1 Introduction

Data structures play important role in several domains. They are represented by graphs as well as by more general data structures like hypergraphs [1, 2] and ultragraphs [3].

One domain the data structures mentioned above are extensively used is software engineering, which explores different data structure representations when both the metamodels for modeling languages and the models, which are compliant to the metamodels are to be constructed.

Other domains constitute very interesting application fields of the data structure representations as well. A good example of such a domain is knowledge mod0eling, which is classified as a part of Artificial Intelligence (AI).. Knowledge modeling requires and explores extensively both data representations (not necessarily data structures representation) and data themselves for the purpose of representing knowledge. Knowledge representation is, in turn required by expert systems also well known in AI domain.

This paper is an extension of the conference paper [4] and is dedicated to the presentation of the latest research results led by the authors in the domain of data representation. A new approach to data structures representation was introduced in [5] and then extended in [4]. The subjected data representation named Extended Generalized Graph (EGG) is defined precisely and completely in this paper. Another important approach to knowledge representation, the Resource Description Framework (RDF) is explored in the paper and is compared to the EGG data representation. The EGG data structures representation is also used in the paper for an illustration of its applicability for knowledge representation. There are some commonly used approaches to data representation, which are traditionally used in knowledge modeling. Thus, the EGG-based approach constitutes an alternative concept of data representation in knowledge modeling domain. A shortened version of the application case study presented in [4] is extended in this paper. Another case study, which is related directly to the knowledge modeling domain is introduced in the paper as well to illustrate the features and possible advantages of the EGG when applied to this AI domain.

1.1 Application of the metamodels for data modeling

Metamodel is a key notion in software engineering domain. The metamodel term refers to a representation (a model) of a modeling language. However, this term may be also related to several data representation languages, which must have their metamodels. A discussion of several metamodels and the data modeling languages was presented in [4]. Such the data modeling languages like Extensible Markup Language (XML), Relational Database Model (RDBM), Unified Modeling Language™ (UML) (when applied to class/object models) and Associaton-Oriented Metamodel (AOM) were taken into account. The set of these metamodels is extended in the paper by the RDF which is also used to represent data. Moreover, it is worth noticing that the metamodels constitute the data models as well.

1.2 Motivation for representing metamodels in terms of the EGG

There are several data modeling languages known from the software engineering domain. The UML [6] is one best known. Its characteristic feature is that the metamodel structure has the form of a graph. However, this language allows to create models of more complex structures. Nevertheless, extensional data structures in UML are very simple and described very generally in the specification. Formally, and within UML’s abstract syntax it is possible to model higher-level graph structures, but in practice it is hard to find a suitable semantics for such the structures, therefore the modellers utilize that very rarely.

Despite the generality of graphs they are not general enough to model a complex reality correctly. Some simplifications are made in such the case what results in the reality deformations. One example of such the simplifications is the reification of the n-ary relationships [7]. There are two main reasons of applying graphs for a more complex problem in software engineering. One results from the limited competences in the modeling domain. Another one results from the limits of the contemporary modeling languages. The most critical limit is the unavailability of the hypergraphs [8] and/or the ultragraphs (a.k.a. the ubergraphs) [9, 10]. These graph generalizations are known from the graph theory but are rarely applied when modeling. The redundancy of the modeling language constructs being the result of the more complex metamodels makes it possible to optimize models and to fit them to the real needs better.

Some extensions and generalizations of the graph structures, named EGG were introduced in [5]. Some more complex graph-like structures with data associated with these structures can be created with the help of the EGG notion in a uniform way. Modeling languages are applied to order reality but, paradoxically the metamodels, even as complex as the UML, are out of the scope of such ordering efforts. That is why the EGG concept was introduced.

1.3 Research problem

One important feature of knowledge representation used in AI domain is its universality and flexibility. Both features are achieved through the simplicity of data representation and the lack of applying a general enough representation of their structure. Data is usually represented by a flat structure of 3-tuples (subject, predicate and object) to represent knowledge.

The authors argue that the lack of a more complex structural data representation results from the lack of the general enough and configurable data representation model. The research problem that has been undertaken is as follows. There is no model for structure for data and knowledge representation that is highly configurable and which has varying level of expressiveness. Thus, data metamodels which are designed under different assumptions are hard do compare in terms of data structures that they allow to model. The authors have proposed an extensible and general network-like structure which could be configured to express complex data. The proposed solution offers such a universal and a uniform approach to represent data.

Thus, the EGG has a chance to be widely used in the AI discipline for knowledge representation.

1.4 Contributions

There are several contributions of the paper. One are the implementations of EGG’s abstract syntax in terms of the AOM and the UML modeling languages as well as definitions of EGG’s concrete syntaxes - a symbolic and a graphical ones. The previously published definition and the metamodel implementations [4, 5] are completed in the paper. Other contribution is a description of EGG categories semantics. An illustrative discussion of the applicability of the EGG for social networks (extended from [4]) and knowledge modeling (introduced in the paper) are also novel. The set of the analysed modeling languages presented in [4, 5] is extended in the paper by the RDF as a knowledge modeling language with its metamodel. The RDF categories are mapped to the EGG categories and the evaluation of data complexity in the RDF data metamodel in relation to the EGG categories is led, which is also a novel element of the paper. Another new concept presented in the paper is the approach to measuring the expressiveness of a particular metamodel through a model created in this metamodel, transformation of this model to the one expressed in terms of the EGG metamodel and then measuring the distance between both models.

Section 3.1 contains more traditional and less traditional implementations of EGG abstract syntax. Definitions of two EGG concrete syntaxes: more universal symbolic concrete syntax (Section 3.2.1) and more readable graphical concrete syntax (Section 3.2.2) are introduced after implementations of EGG abstract syntax. Several languages are mapped in Section 4 to the EGG in order to show how the categories from these languages can be expressed in terms of the very general EGG categories. Then, some case studies are presented – in Section 5.1 a way of using the EGG for a social training group (in relation to the case study from [5]) is illustrated and in Section 5.2 the application of the EGG for representing knowledge is shown. In the rest of the text the complexity of the chosen modeling languages is analysed. The complexity is related for each language to the usage level of all possible categories offered by the EGG. Finally, the conclusions arising from the achieved research results are presented.

2 Related work

There are several domains and threads in the scientific literature, which are related to the subject of the paper.

The fundamental question regarding the approach to representing the data structures, which is used in the paper is formulated in [11]. The question is if the mathematical models or metamodels should be applied to define the data structures in computer science (informatics). The paper shows the fact, which is supported by the examples that the metamodels are more useful in the IT community. Because the metamodels not only better fit well-established solutions in computer science, e.g. the programming languages, but also thanks to their high readability and support with graphic notations, they better meet the needs of a wide group of users. In the cited work, a number of approaches used to define data structures is given.

The authors of this publication share the opinions contained in the work [11] – therefore, after defining the EGG with the help of some set-theoretic concepts, they consistently apply an approach based just on metamodeling. Moreover, in contrast to the work [11], the authors propose a far-reaching generalization of a simple definition of a graph as an alternative approach to the considerations and the comparisons presented in [11].

There are also several publications dedicated to the different graph representations, including some generalizations of graphs, like hypergraphs [1, 2] and ultragraphs [3], which were already mentioned in Section 1. A different and a very general approach to the graph-like representations was proposed by the authors in [4, 5]. Instead of defining the special cases of even very general representations the mentioned approach relies on defining a general data model together with some mechanisms of limiting its generality and extended character to the needs required by a user of the approach.

Another thread of publications is related to the data modeling problems. One important publication is [12], which may constitute a reference for the elaboration of the tools and the standards originated by the EGG-related approach. An interesting problem connected to the transformations of data models expressed in different metamodels is analysed in [13]. This problem can be related to the data model representation introduced by the EGG as one application of the presented concept of the data representation.

Large group of publications is connected to the particular application domains, also mentioned in Section 1, where data modeling is explored and plays an important role for these domains.

One important domain where the data representation plays the key role is defining the modeling languages and the models in these languages, which is characteristic for software engineering and can be used for the model-driven automated generating of the software systems [14]. There are some commonly known standards in the software engineering domain, which are managed and developed by the Object Management GroupTM (OMG). Some standards, like the Meta-Object FacilityTM (MOF), UML and several other ones support the approach to the software development processes according to the Model Driven Architecture®; (MDA) [15]. All standards mentioned above and especially their metamodels are based on the basic version of the graph notion. This fact limits the possibility of full use of the modeling potential. There are two kinds of the modeling languages known from the software engineering domain, namely the Domain-Specific Modeling Languages (DSML) explored in [16, 17] as well as the General Purpose Modeling Languages (GPML) mentioned in [18]. Application of the EGG for both the DSML and the GPML gives a chance to break the limits of these languages and introduce a basis for a common approach to constructing them.

A domain of special interest in the paper is knowledge modeling as a branch of artificial intelligence. One important knowledge representation technique is semantic networks. Usually, in semantic networks the binary edges are assumed according to the RDF standard discussed in the paper. However, there are the hypergraph generalizations known in the semantic network models [19]. Semantic networks are well established in computer science and information systems [20, 21] and are still subject of interest [22].

Data models of the higher expressive power, such as the hypergraph models are used also in network analysis [23], scientometrics [24], natural language processing [25].

It should be underlined that the importance of data modeling is confirmed by extensive standardization efforts already mentioned above. In this paper we address the limits of the standards, which result in the difficulties when the domains relying on the standards are explored and developed.

Taking into consideration all the state-of-the-art solutions mentioned above, they (explicitly or implicitly) implement the graph-like structures with different expressive power. In contrast to the state-of-the-art papers, our proposed solution can be perceived as a single formalism which has the superset of graph-like properties and encapsulates them in the form of the generalizations and the extensions. To describe a graph or a hypergraph, or even an ultragraph structure one can use EGG formalism and configure its properties in order to be well-suited to the modelled system. Due to this fact, the EGG is featured by universality in terms of capturing the graph orientation with the different properties. On the other hand, the specific solutions are restricted to their own expressive power and cannot cross beyond them.

3 Definitions of the extended generalized graph (EGG)

The EGG definition was introduced, in its original version in [5] and then it was slightly expanded in [4]. A characteristic feature of these definitions was to focus more on the definition of the graph itself than on the overall concept of the EGG. In this article, the above-mentioned definitions have been extended to include just the EGG notion as a whole. The new concepts introduced to the EGG are expressed in the paper in the form of the AOM and the UML implementations of the EGG metamodel.

The very idea of the EGG concept is based on certain assumptions, which are summarized below. The key such an assumption is that each category of a metamodel should have a function (it can be expected of any metamodel, not only just the EGG one). These functions may be characterized for the particular EGG categories as follows:

  • the interrelating categories:

    • composition – a lifetime dependence and exclusive ownership relationship;

    • reference – no lifetime dependence and no exclusiveness relationship;

  • the interrelated categories:

    • Container is a category physically COMPOSING (in an extensional sense) all other categories, that is it owns all other categories through the composition relationship;

    • Egg is a GROUPING category, i.e. it is something like a set, it semantically groups the categories on the basis of a reference relationship;

    • Edge is a CONNECTING category, and its semantics focuses on the fact that it represents the n-ary relationship. It has a referential relationship nature. Within the EGG concept, it can also be a connected category;

    • V ertex is a CONNECTED category, semantics focuses on the fact that it can participate in the reference relationships, it cannot connect other categories;

    • Label is a DESCRIBING category, i.e., semantically speaking, it is a place for information that aims to associate any information with Egg, Edge, V ertex. Label cannot reference or be referenced (in terms of reference relationship) apart from the one that semantics is about to describe, and therefore provide additional information in any form.

The formal definition of the concept of the EGG characterized above is presented in Section 3.1.

3.1 EGG abstract syntax definitions

EGG abstract syntax in version focused on the graph structure itself has been defined in the previous publications [4, 5] in terms of the set theory. In this paper the enriched metamodel is specified from the perspective of two metamodel implementations. First, the AOM-based implementation and then the UML-based implementation are introduced.

3.1.1 AOM implementation of EGG abstract syntax

The AOM is a novel data metamodel, which has been proposed by Krótkiewicz as a unified platform for data modeling with high semantic capacity. This metamodel covers a symbolic formalism called the Associaton-Oriented Formal Notation (AFN), used to model definition. Both syntax and semantics of the AFN can be found in [26].

First, the AOM-based implementation (Fig. 1). The EGG was implemented in the AOM using the AFN as follows:

(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
$$ \Box Connection \left\langle +navigability : ascii \right\rangle ; $$
(9)
Fig. 1
figure 1

The AOM implementation of the EGG concept metamodel

All roles in the AOM model are referential, i.e. they have the references to the objects. The association-side composition ♢Container was used in + egg, + vertex, + edge and + label roles as the physical instances container to implement the Container. Each such the instance belongs to exactly one ♢Container. The ♢Element abstract association implements the polymorphism mechanism in relation to ♢Egg, ♢V ertex and ♢Edge, which allows ♢Egg to aggregate these instances using the + aggr role. In addition, the ♢Edge association is able to connect the listed instance categories via the ♢Connection association.

3.1.2 UML implementation of EGG abstract syntax

The UML implementation of the EGG consists of both the UML class diagram expressed in Fig. 2 and the Object Constraint Language (OCL) specification of the constraints for the model presented on the diagram.

Fig. 2
figure 2

The UML 2.5.1 and the OCL 2.4 implementation of the EGG concept metamodel

The diagram from Fig. 2 is adorned by the following constraints which are expressed in the natural language first and then are specified formally via the OCL expressions:

  • there is exactly one instance of the Container

    context Container inv UniqueContainer: self.oclType().allInstances() -> size() = 1

  • there is at least one Egg which is contained in the Container

    context Container inv MinCountEgg: self.element.graph -> size() > 0

  • there is exactly one Egg (the root Egg), which is not contained in other Egg

    context Egg inv RootEgg: Egg.allInstances() -> select(graph -> size() = 0 ) -> size() = 1

  • Egg cannot be self contained,

    context Egg inv NotSelfEgg: not self.aggr -> includes(self )

  • Egg cannot be cyclic in terms of the aggr aggregation,

    context Egg inv AcyclicEgg: self.graph -> closure(graph ) -> excludes(self )

  • each V ertex must be contained in at least one Egg

    context Vertex inv ContainedVertex: self.graph->size() > 0

  • each Edge must be contained in at least one Egg

    context Edge inv ContainedEdge: self.graph->size() > 0

In order to systematize terminology related to EGG abstract syntax, the following terms are introduced, which are used later in the text:

  • EGG concept – the whole data structures modeling concept based on EGG abstract syntax,

  • EGG metamodel – a definition of the EGG concept, which is composed of the EGG abstract syntax and semantics,

  • EGG structure – a term referring to the data structure represented by EGG equivalent to the EGG metamodel instance,

  • EGG category – any category of EGG abstract syntax, that is Egg, V ertex, Edge, Label, Element,

  • EGG category instance – the concretized Egg, V ertex, Edge, Label categories,

  • EGG element – any subcategory of Element, that is Egg, V ertex, Edge,

  • reference – a no lifetime dependence and no exclusivity relationship,

  • composition – a lifetime dependence and an exclusive ownership relationship,

  • aggregation – a kind of the reference relationship,

  • connection – a kind of the reference relationship,

  • description – a kind of the reference relationship.

3.2 EGG concrete syntax definitions

This section contains a presentation of two concrete syntaxes (notations) of the EGG. Both concrete syntaxes are compliant to abstract syntax defined in Section 3.1. First, symbolic concrete syntax and then graphical concrete syntax are presented.

3.2.1 Symbolic EGG concrete syntax

The EGG-based structures are extensional and, as such, they have no constraints beyond those defined in the EGG abstract syntax definition.

The symbols and the rules used:

  • The metamodel category instance identifier is unique within the model. The identifier may have any symbolic form.

  • The symbols 〈⋅〉 represent a comma-separated list of components, e.g. 〈a,b,c〉.

  • The symbols \(\left \lbrace \cdot \right \rbrace \) are brackets that group items of the same category, such as {a,b,c}. The curly brackets can also be used to group the items connected with the same type of connection.

  • The references for each category are denoted as follows:

  • If several consecutive instances have the same category, then the category designation can be drawn in front of a curly bracket delimiting a set of the instances.

    Some examples:

  • If several consecutive instances are connected by the same kind of a connection, the connection designation can be: , drawn in front of a curly bracket that delimits a set of instances.

    Some examples:

  • An instance of a given category is marked with an underlined symbol of the category: together with an identifier and a symbol of the list 〈⋅〉, in which there are the components of the instance.

    Some examples:

    • – an Egg category instance with the g identifier,

    • – a V ertex category instance with the v identifier,

    • – an Edge category instance with the e identifier.

    In the case of the Label s, the list symbol is not added as it does not aggregate any EGG categories.

    An example: – a Label category instance with the l identifier.

    The labels are connected with the ⋯ symbol, which is intended only to visually distinguish the annotating relationship of the labels from the other relationships. It is only the so-called syntactic sugar, as the labels cannot be aggregated by the Egg structures and formally there is no need to use a special notation for the descr relationship.

    Some examples

    • – an Egg category instance with the g identifier is described by the labels with the l1 and the l2 identifiers,

    • – a V ertex category instance with the v identifier is described by the labels with the l1 and the l2 identifiers,

    • – an Edge category instance with the e identifier is described by the labels with the l1 and the l2 identifiers.

  • The Label values can be optionally specified after the colon symbol.

    An example

A detailed specification of mapping concrete EGG symbolic syntax with EGG abstract syntax is presented in the Tables 12345.

Table 1 Concrete syntax of the navigability category
Table 2 Concrete syntax of the Egg category
Table 3 Concrete syntax of the Edge category
Table 4 Concrete syntax of the V ertex category
Table 5 Concrete syntax of the Label category

The symbols in the context of concrete syntax (Tables 67 and 8).

id :

– an identifier (unique in the scope of the whole EGG structure)

3.2.2 Graphical EGG concrete syntax

The root Egg may be omitted in graphical concrete syntax.

4 Mappings between selected data metamodels and EGG

A proposal of a set of possible mapping schemata between some metamodels and the EGG metamodel is contained in this section. Different mappings may be also introduced but the one chosen reflects the characteristic features of each metamodels category well.

One shall notice, that the EGG is a purely extensional data structure. It means, that it does not cover any behavior – and thus gives freedom to define the structural constraints on data. The mapping that the authors have proposed implements gives a possible interpretation of the EGG concepts in terms of the specific data metamodels. The following metamodels are discussed in this section: RDF, XML, RDBM, UML and AOM.

Table 6 Concrete syntaxes of the Egg category
Table 7 Concrete syntaxes of the Edge category
Table 8 Concrete syntaxes of the V ertex category

The RDF is a graph-based representation of the objects in the web. If we think of the XML we analyze the pure XML-based hierarchical structures of the documents. The RDBM covers the relational databasemodel. The UML is a modeling languages, which also facilitates creating a data representation on an object diagram containing InstanceSpecification s. The AOM is a novel data metamodel, both conceptual and physical, which has two distinguished parts: the extensional the intensional. Here, we focus on the extensional part only. Giving the semantic interpretation of metamodel constructs into the EGG categories makes it possible to designate the EGG configuration for a specific metamodel and further to compare the corresponding data structures. The process extensively uses the reification in terms of representing the higher level relationships if the specific data metamodel is solely data-oriented.

Several models were created in the RDF, the XML, the RDBM, the UML and the AOM to present the mappings inference mechanism and the obtained mapping results. Next, the applied constructs were recognized in terms of the EGG. As the result of this approach it was possible to identify which metamodel categories are actually required by the models in terms of their extensional layer. Thus, the metamodel categories which represent instances (model elements) were mapped to the EGG metamodel categories.

Table 9 The RDF to the EGG metamodel categories mapping schema

The Tables 9101112 and 13 of the succeeding subsections contain the results of mapping each model category to the EGG metamodel category. The first column of each table contains the categories of a particular metamodel while the second one – the categories of the EGG metamodel.

Mappings presented in this chapter show how the specific categories from a subsequent models can be perceived in terms of the EGG. The mappings proposed might extend the semantics of the mapping result. This means that the resulting EGG category instances might have broader meaning than they have had before mapping.

Table 10 The XML to the EGG metamodel categories mapping schema
Table 11 The RDBM to the EGG metamodel categories mapping schema

4.1 Resource description framework (RDF)

The RDF is a simple graph-based model proposed by the W3C, which is used to represent information in the web. The RDF graphs are the sets of the triples, each of which consists of three elements, namely a subject, a predicate and an object. The RDF specification also covers three types of the RDF data that may occur in the triples. They consist of the IRI, the literal or the blank node. The IRIs are the identifiers which identify some resources. The literals are basic values, which are associated with a specific datatype. The blank nodes are nodes without a global identifier.

The mapping of the RDF to the EGG involves juxtaposition of their graphical structures, as shown in the Table 9.

Table 12 The UML to the EGG metamodel categories mapping schema
Table 13 The AOM to the EGG metamodel categories mapping schema

4.2 Extensible markup language (XML)

The XML is a markup language focused on storing data in both a textual and a hierarchical forms [27]. These features make it intriguing for data modeling and for analysing it from the more general and novel EGG point of view. The XML categories are identified directly from the XML files (models) without any general definitions of the XML structure (e.g. XML Schema). This decision is motivated by the fact that the analysis in this paper refers to the extensional but not the intensional layer. The results of the identified mapping between the XML category instances and the EGG are presented in the Table 10.

It can be seen in the Table 10 that both Egg and Edge categories are applied with some limits.

4.3 Relational database model (RDBM)

The RDBM constitutes a formal approach to represent the databases. The key notions in the RDBM are relation and tuple. Originally, it has been defined by Edgar Frank Codd [28] in terms of the set theory. The mapping between data contained in the relational databases (models) and the categories of the EGG category instances is contained in the Table 11.

Relational data are conceptually very simple. The extensional model of data consists of the relations, which aggregate the tuples (records). The tuples consist of the order values. Each tuple has a distinguished set of the values, which stand for the tuple’s primary key. To show the relationships between data, referential integrity constraints are often used. Using the latter, one can constrain the domain of the value in such a way that it models a reference to the other value, most often fulfilling the role of the primary key in other tuple.

It is clearly seen in the Table 11 that the Edge category is applied in the RDBM with some limits.

4.4 Unified modeling language (UML)

The UML [6] is a modeling language, which has been standardized and managed by the OMG for many years. The language is intended for modeling both the structural and the behavioral aspects of the software-intensive systems. This language was taken into account because of its very rich semantic capacity when applied to data modeling. The extensional part, however, seems to be rather simple and straightforward in its inner construction.

The results of the comparative study focused on mapping the EGG category instances to the UML models are presented in the Table 12. The extensional part of the UML structural models is covered by the object diagram. The object diagrams are the instances of the class diagrams and are used to show the snapshot of a system at a specific point in time. The InstanceSpecifications are used to show the instantiated classifiers (i.a. classes and associations). These are covered by the objects and the links, respectively. The fact, that an object has some value for a specific attribute is depicted with the slots with ValueSpecifications.

It can be noticed in the Table 12 that the Egg metamodel category is the only one the UML categories cannot be mapped to. The mapping of the other categories is clear and all their features can be mapped directly.

4.5 Association-oriented metamodel (AOM)

The AOM is a data metamodel the characteristic features of it are: implementability, semantic unambiguity, high semantic capacity and high expressiveness. The fundamental notion in the AOM is Association, which is a first-class category [29] according to [30]. Association consist of Roles, which might be played by Collections or by another Associations. The AOM enables the very complex means for data abstraction in terms of the data and relationships polymorphism [26]. The AOM’s extensional part must follow the structural constraints of the model strictly. Collections, Associations and Roles are instantiated as interlinked Objects. The identified mapping between the AOM category and the EGG categories can be found in the Table 13.

The Table 13 shows that the AOM metamodel contains good decomposition of the responsibilities of the metamodel categories in terms of the EGG concept. In the consequence the AOM categories can be easily mapped to all EGG categories. Thus the AOM categories map to all EGG categories in contrast to the UML.

5 Case studies

There are two case studies presented in this section both related to the usage of the EGG data structures and knowledge representation definitions. The first example concerns domain-specific data, namely a structure of an EGG, which represents a social network. The second example is dedicated to illustrating the EGG features like hyper and ultra. Another purpose of the second example is presenting an option of nesting one EGGs in other EGGs. This option is achieved through defining some patterns for simplified structures of the knowledge representation patterns together with a domain-specific example making use of these patterns. Both case studies will be used for further considerations, also in terms of the possibility of their implementation in the selected data metamodels, taking into account the mapping schemes presented in Section 4.

It is clear from the case studies (the extensional level) presented in this section that the EGG concept was introduced to break some limits of the different metamodels already identified in Section 4 on the intensional level.

5.1 Social training group

Social network is a structure consisting of the social actors (e.g. individuals, organizations) and the binary connections between them, which represent their interactions. To improve the semantic capacity [31] of such the network, it might be beneficial to introduce other information patterns regarding the types of the relationships, e.g. the n-ary connections [32], the relationships as participants, grouping, etc. We have employed the EGG to model the structure of the social interactions within a training group.

The structure holds following semantics regarding the training social group: The domain TrainingDomain contains a training group TrainingGroup of its members who interact with each other at the martial arts training. These members are represented as V ertices. Assume that if specific people train with each other this fact will be represented as trains with edge, which connects subsequent vertices. Information about the people’s age is associated with the vertices using the data blocks prefixed with age:. Some people might be the supervisors; they are introduced inside a shared nested Egg Supervisors (thus the supervisors are the members and also a part of the training group). Let the fact, that a specific supervisor set is responsible for a specific supervision be represented as the supervision edge, which connects this set to a specific trains with edge. Some people might be tutors and tutor other people (or even themselves). Let us represent this fact as a binary, directed edge, connection for the role of tutor shall be directed towards the edge and the opposite connection towards the node respectively. Moreover, the tutor edge should be attributed with data containing temporal information about this tutoring, prefixed with year:.

The EGG structure in the Fig. 3 has been created in order to express all possible extensions and generalizations which can be defined within the EGG structures. This means that these structures hold the properties of an \(EGG_{FDN}^{HUMS}\) and the proposed EGG is generalized by the following features: Hyper, Ultra, Multi, Shared Aggregation and is extended by the following features: First Class, Data, and Navigability. Detailed description of the above EGG features has been presented in the paper [5].

Fig. 3
figure 3

An example of the \(EGG^{HUMS}_{FDN}\) model structure for the domain of a social training group

Below one can find a specification of the same EGG within a symbolic notation.

List of Eggs:

(10)

List of V ertices:

(11)

List of Edges:

(12)

5.2 Semantic network

One well-known group of methods for knowledge representation is semantic networks. Such the networks are built upon the graph-like structure by using vertices and edges in order to represent a declarative knowledge (i.a. definitional or assertional). The vertices in a semantic network may represent terms or concepts respectively. The edges are often directed, thus they might be indicted. Moreover, they can have various semantics and can represent different relationships, e.g. meronymy, hyponymy, hypernymy [33].

The classical models of semantic network consider the binary edges solely. However, there are models which are hypergraphic, due to n-ary hyperedges [19]. The applications of semantic networks comprise thesauri definition [34], knowledge visualization (e.g. financial knowledge [35]), natural language processing [36]. Semantic networks have been in the interest of the researchers in computer science and information systems since a dozen decades [21, 37]. There have been proposed many extensions, i.a. partitioned semantic networks [38] and operator-operand networks proposed by Krótkiewicz et al. [22]. Partitioned networks are objectifying the networks themselves in such a way that they could participate in edges.

In the Fig. 4 we have presented an exemplary semantic network defined within the EGG concepts. This network represents the definitional relationships between a few concepts in order to show the simplified relationships and the emotions in a family context. The following schematic constraints have been assumed:

  • Egg s representing networks and facts (represented as shared, partitioned subnetworks),

  • edges representing predicates, which are asserting linkages of elements in a network,

  • vertices representing concepts which participate in a semantic networks,

Fig. 4
figure 4

An example of the \(EGG^{HUS}_{FN}\) model structure for an exemplary semantic network

The following predicate types have been distinguished:

  • n-ary predicate IS A(subtype,supertype1, \( supertype_{2}, \dots , supertype_{n} )\); these predicates connect the specific elements and assert that the subtype element is less specific than the other elements, which are more specific within this relationship; the hyperedges are directed, the ElementToEdge direction denotes the subtype, the EdgeToElement direction denotes the supertypes;

  • n-ary predicate INSTANCE OF(instance,type1, \( type_{2}, \dots , type_{n})\); these predicates connect the elements being the instances of another elements being the types; the hyperedges are directed, the ElementToEdge direction denotes the instance, the EdgeToElement direction denotes the types;

  • n-ary predicate \(\mathit {LIKES} (subject, object_{1}, object_{2}, {\dots } ,\) objectn); representing the statement that an actor associated with the subject element likes a series of other elements; it is required for this type of predicate to have directed connections, the ElementToEdge direction denotes the subject, the EdgeToElement direction denotes the objects;

  • binary predicate \(\mathit {PARENT-OF} \left (parent, child \right )\); representing the family bonds between two actors, the edge has to have directed connections, the ElementToEdge direction denotes the parent, the EdgeToElement direction denotes the child;

This semantic network encodes the following data on the information and the knowledge layers.

5.2.1 Defining family members

Specific human entities have been represented as vertices. In our semantic network example, we have distinguished three people called John, Paul and Mary respectively. This fact requires the following vertices to be asserted:

(13)

5.2.2 Defining gender

In the semantic network three vertices representing gender (two for concepts representing specific genders and one for the gender itself) have been distinguished:

(14)

The fact that ∙ male and ∙ female are specific kinds of ∙ gender is represented by the ternary edge with IS-A semantics:

(15)

5.2.3 Assigning gender (instantiation)

Representation of the fact, that a specific human entity has a gender is modeled as the INSTANCE-OF directed edge which connects specific vertices (representing human and gender):

(16)

5.2.4 Parent-child relationship

Analogously to the instantiation, one can distinguish the binary predicates representing the facts on the parent-child relationship:

(17)

5.2.5 Liking

Another n-ary predicate has been used to model liking. In order to represent a fact that John likes that he is parent of Paul and that Paul is parent of Mary one need to construct a connection to the Egg, which represents objectified facts, as shown below:

(18)

In the case when objectification is not needed, one could refer to the edge itself, as the target of the connection. The structure below represents the fact that Mary likes Paul being male.

(19)

6 Data complexity of metamodels

The cohesion of the data modeling approaches presented in Section 4 to the data modeling based on the EGG structures is analysed. Due to the fact that that the EGG structures are graph-like at the same time being more general and more extensive than the remaining ones the methodology applied in this section is as follows.

Two case studies are examined: social training group (presented in Fig. 3) and semantic network example (depicted in Fig. 4). The first one stands for a small and comprehensive EGG which has been modeled in such a way that it forms a fully-featured EGG, i.e. covers all extensions and generalizations. The second one strives with solving the knowledge representation problem in a schema-less structure with abundance of the semantic predicates. The examination covers translation of these structures into the EGG structures which are consistent with the subsequent data metamodels.

The EGG structures have been elaborated in relation to the mappings described in Section 4. These EGG structures hold EGG-specific semantics presented in Section 3, which results from the approach. Another part of semantics has metamodel-specific form, which uses only EGG expressions having reasonable meaning in a particular data metamodel. For example, in the case of the data metamodels with the reference-based relationships, the edges might be binary and navigated in a strictly defined manner.

6.1 Resource description framework (RDF)

The Figs. 5 and 6 show the EGG representations containing the proper case study models expressed in the RDF.

Fig. 5
figure 5

The EGG structure resulting from the mapping of the social training group model to the RDF

Fig. 6
figure 6

The EGG structure resulting from the mapping of the semantic network example model to the RDF

The representations of the RDF expressions are simple because they are based on the triples of the Subject-Object-Predicate (SOP). Therefore, all relationships in the original EGG structures of a different arity had to be mapped to the binary edges. Edges of a different nature were reified to nodes. Moreover, the Subject-Predicate-Object (SPO) structures enforce a navigability, so a navigability was added in the case the model contained anon directed connection. Thus the RDF in the mapping to EGG has a Navi property, which is downright forced. The RDF data also has the Data property - when an element has a label, this is reflected by representing the node as a literal or by attaching an IRI to it. Additionally, it should be noted that it is possible to map the Multi property, but its occurrence in the social training groups model was not mapped due to the reification of the edge that represented such the relationship. The properties First Class, Shared Aggregation, Hyper, or Ultra cannot be represented in the RDF.

6.2 Extensible markup language (XML)

The Figs. 7 and 8 show the EGG representations containing the respective case study models expressed in the XML.

Fig. 7
figure 7

The EGG structure resulting from the mapping of the social training group model to the XML

Fig. 8
figure 8

The EGG structure resulting from the mapping of the semantic network example model to the XML

The mapping of the structures social training group and semantic network example in the XML was done in a way that reflects the specificity of the tree representation of the documents. The entire document is represented by the EGG, which lists the V ertices with Labels (representing the attributes). Vertices from the original structure have been represented by Egg, if they have the substructure in terms of the tree-based representation. Then, Edges implementing binary, unidirectional relationships, characteristic of the tree-based structure of XML have been reified and their semantics has been transferred to the grouping aspect of Egg. Vertices forming leaves in this aspect have been mapped directly to vertices. The remaining Edges were represented by the reference mechanism defined in the XML schemata. The directionality has been used to represent which attribute refers to the other.

The XML allows for the representation of the Hyper generalization thanks to the XMLREFS capability, allowing to refer to the multiple elements within one relationship representation. It is also possible to make the Multi generalization, as there is no limitation to referring to a given element multiple times within a single relationship representation. In terms of the extensions, all available EGG extensions are realizable with the XML. Due to the fact that some XML nodes are represented by Egg, this representation enables the property of First Class, i.e. it is possible to refer to it as a reference. Attributes give the opportunity to map Label. The Navi property is strictly defined regarding the direction of individual reference elements – but it is not possible to model navigable references in a bidirectional manner.

6.3 Relational database model (RDBM)

The diagrams on the Figs. 9 and 10 show the EGG representations containing the case study models expressed in terms of the Relational Database (RDB).

Fig. 9
figure 9

The EGG structure resulting from the mapping of the social training group model to the RDBM

Fig. 10
figure 10

The EGG structure resulting from the mapping of the semantic network example model to the RDBM

The mappings of the EGG structures into the EGG which have semantics and constraints of the RDBM have been based on the representation of both V ertices and Edges. Labels are represented as the attributes of these tuples. In order to keep a strong structure and a common set of the attributes for each tuple , data blocks containing NULL values have been generated. Aggregating tuples into the relationships also implies the need to group them. It was mapped into additional Eggs instances, which group structurally the homogeneous elements.

The RDBM allows the representation of the relatively simple EGG structures. The relationships realized by the referential integrity constraints are binary directed, and this direction is always from the referring element to the referred element. This implementation makes the Edges in the RDBM analogous to the RDF. Therefore, the RDBM does not support the Hyper property and has a Navigability property with the strongly defined rules. Since the relationships are not the first-class categories and therefore they cannot participate in other relationships, the Ultra property is not supported. Attributionality of tuples provides the Data property. The relation category instances mapped to Egg are used to show semantics of grouping, so Shared aggregation and First class do not exist in the RDBM.

6.4 Unified modeling language (UML)

The diagrams on Figs. 11 and 12 show EGG representations containing the case study models expressed in the UML.

Fig. 11
figure 11

The EGG structure resulting from the mapping of the social training group model to the UML

Fig. 12
figure 12

The EGG structure resulting from the mapping of semantic network example model to the UML

Implementation of structure mappings in the UML was based on an object diagram, within which a number of InstanceSpecification was considered compliant to the model based on a class diagram, classifying nodes as classes and edges as associations. It was assumed that the nodes with the EGG representing people will be matched by the instances of the class with the attribute representing data. The fact of distinguishing a subset of nodes performing the subsumption was taken into account as a subclass. The edges described with data were implemented as the specifications of an association class instance.

The internal structure of the UML instance is quite complex. UML satisfies the Data extension due to the possibility of specifying for the specific slot in InstanceSpecification. Due to the n-ary links, the instance models are able to have Hyper and Multi properties. The model specification also does not describe the limitations due to the classification of the instance specifications in terms of combining them – therefore the models have the Ultra property. It should be noted, however, that class diagrams that classify UML instance models in the form of object diagrams do not in any way support the ultragraph relationships. Due to the lack of a direct equivalent of the EGG, the properties directly related to it, i.e. First class and Shared aggregation, are not supported.

6.5 Association-oriented Metamodel (AOM)

The diagrams on Figs. 13 and 14 show the EGG representations containing the case study models expressed in the AOM.

Fig. 13
figure 13

The EGG structure resulting from the mapping of the social training group model to the AOM

Fig. 14
figure 14

The EGG structure resulting from the mapping of the semantic network example model to the AOM

The extensional part of an Association-Oriented Metamodel is a direct consequence of the inherent intensional part of the metamodel, expressing a structure based on the collections and the associations. This part of the metamodel operates on the categories such as Object, Role Object, and Association Object. The nodes represent Objects, while the edges represent Association Objects. Role objects, on the other hand, are mapped as connections. The fact that Objects have an internal structure, which is represented in the form of values is mapped through labels. Edges requiring data annotation were mapped as instances of the BACT [39] association-oriented structure pattern. Additionally, it should be noted that the Association-Oriented intensional categories such as Collection and Association have an extensional aspect, i.e. they group the individual object instances, therefore they were also mapped in the resulting Egg structure as the grouping Egg.

The extensional models of the Association-Oriented metamodel allow all generalizations and EGG extensions to be expressed except First class and Shared aggregation. They assume, in turn, the participation of Egg representation in the relationships and in the possibility of sharing the grouped category instances. Such properties of the extensional structures are not directly expressible in the Association-Oriented metamodel.

6.6 Possible implementations

The models and the considerations presented above apply to the case studies expressed in terms of the EGG. These structures map to the target metamodels that reflect the data capabilities that are consistent with the selected data metamodels. In order to talk about actual data structures, these EGG structures should be implemented in the form of the expressions in the specific metamodels. Below the implementations of the social training group (Listing 1) and the example of semantic network (Fig. 15) models for XML and the AOM are presented.

Listing 1
figure at

An XML implementation of the Social training group

Fig. 15
figure 15

An Association-Oriented implementation of the Semantic network example – the intensional part

Table 14 The EGG features configuration

The Listing 1 shows one possible implementation of the document that encodes the EGG structure expressed in the Fig. 7. The Figs. 15 and 16 show a possible implementation of the semantic network model in the AOM. The intensional part concerns the data structure, which allows for the expression of data compatible with it. The extensional model contains the direct translation of the EGG structure expressed in the Fig. 14.

Fig. 16
figure 16

An Association-Oriented implementation of the Semantic network example – the extensional part

6.7 Summary

The EGG features in terms of the analysed data metamodels are summarized in the Table 14. Presence of a specific feature in the extensional layer of a metamodel is marked by the black circle, while its absence - by the dash. The existence of a specific EGG feature in a metamodel but with some syntactical constraints thus enforcing some limitations is denoted by the white circle. The EGG can be used as a reference data metamodel when evaluation of other metamodel expressiveness and semantic capacity [39] is considered. As the result of such the analysis a right metamodel for a specific modeling issue may be selected better. This approach is appropriate if the EGG structure representation complexity of the domain problem is considered.

7 Evaluation of research results

The evaluation of the research results should be related to the measure notion. There are several approaches to defining a measure, which can be applied for the EGG concept:

  • a syntax-related

    • a one based on the distances between models created in the different metamodels in relation to the model in the EGG,

    • a one based on the distances between these parts of the metamodels, which were used in a particular model related to the EGG metamodel,

    • a one based on the distances between the complete metamodels related to the EGG metamodel,

  • a semantics-related.

There are also several measures defined for graphs, which are known from minimal-distance structural pattern recognition methods. There are also measures from the edit distance group, which are based on expressing the distance in terms of the total minimal cost of the addition or the removal of the graph nodes. The mentioned measures are interrelated.

The semantics-related approach only is taken into account in the presented paper due to the high conceptual and computational complexity of the syntax-related approaches, which are planned to be explored in the succeeding publications.

In order to evaluate the research results obtained we have proposed the measures to compute semantic preservation of the original EGG in the resulting EGG structures. The methodology of the evaluation is as follows. We have examined each EGG category instance of the resulting EGG structures in all data metamodels under consideration. Each category instance has been thoroughly evaluated and classified into one of the three groups:

  • Translated (T) – the category instances that are in the resulting structure and are direct mapping of the category from the original EGG,

  • Altered (A) – the category instances that are in the resulting structure however their function has been altered during mapping, i.e. via reification,

  • Forced (F) – the category instances that are in the resulting structure and have been forced by the resulting data metamodel due to its nature.

Moreover, there is the 4th category Lost (L) which categorizes the category pseudo-instances that are absent in the resulting EGG structure but were present in the original one.

Table 15 The results of the Social training group EGG case study evaluation

Taking into consideration this grouping, the following metrics have been proposed.

Regarding the Egg category:

$$ Eg_{T} = \! \frac{\text{number of} \textit{Egg} \text{instances classified as} \textit{T}}{\text{total number of} \textit{Egg} \text{instances and pseudo-instances}} $$
(20)
$$ Eg_{A} = \! \frac{\text{number of} \textit{Egg} \text{instances classified as} \textit{A}}{\text{total number of} \textit{Egg} \text{instances and pseudo-instances}} $$
(21)
$$ Eg_{F} = \! \frac{\text{number of} \textit{Egg} \text{instances classified as} \textit{F}}{\text{total number of} \textit{Egg} \text{instances and pseudo-instances}} $$
(22)
$$ Eg_{L} = \! \frac{\text{number of} \textit{Egg} \text{instances classified as} \textit{L}}{\text{total number of} \textit{Egg} \text{instances and pseudo-instances}} $$
(23)

Regarding the V ertex category:

$$ V_{T} = \! \frac{\text{number of} \textit{Vertex} \text{instances classified as }\textit{T}}{\text{total number of} \textit{Vertex} \text{instances and pseudo-instances}} $$
(24)
$$ V_{A} = \! \frac{\text{number of} \textit{Vertex} \text{instances classified as }\textit{A}}{\text{total number of} \textit{Vertex} \text{instances and pseudo-instances}} $$
(25)
$$ V_{F} = \! \frac{\text{number of} \textit{Vertex} \text{instances classified as} \textit{F}}{\text{total number of} \textit{Vertex} \text{instances and pseudo-instances}} $$
(26)
$$ V_{L} = \! \frac{\text{number of} \textit{Vertex} \text{instances classified as} \textit{L}}{\text{total number of} \textit{Vertex} \text{instances and pseudo-instances}} $$
(27)

Regarding the Edge category:

$$ E_{T} = \! \frac{\text{number of} \textit{Edge} \text{instances classified as} \textit{T}}{\text{total number of} \textit{Edge} \text{instances and pseudo-instances}} $$
(28)
$$ E_{A} = \! \frac{\text{number of} \textit{Edge} \text{instances classified as} \textit{A}}{\text{total number of} \textit{Edge} \text{instances and pseudo-instances}} $$
(29)
$$ E_{F} = \! \frac{\text{number of} \textit{Edge} \text{instances classified as} \textit{F}}{\text{total number of} \textit{Edge} \text{instances and pseudo-instances}} $$
(30)
$$ E_{L} = \! \frac{\text{number of} \textit{Edge} \text{instances classified as} \textit{L}}{\text{total number of} \textit{Edge} \text{instances and pseudo-instances}} $$
(31)

Regarding the Label category:

$$ L_{T} = \! \frac{\text{number of} \textit{Label} \text{instances classified as} \textit{T}}{\text{total number of} \textit{Label} \text{instances and pseudo-instances}} $$
(32)
$$ L_{A} = \! \frac{\text{number of} \textit{Label} \text{instances classified as} \textit{A}}{\text{total number of} \textit{Label} \text{instances and pseudo-instances}} $$
(33)
$$ L_{F} = \! \frac{\text{number of} \textit{Label} \text{instances classified as} \textit{F}}{\text{total number of} \textit{Label} \text{instances and pseudo-instances}} $$
(34)
$$ L_{L} = \! \frac{\text{number of} \textit{Label} \text{instances classified as} \textit{L}}{\text{total number of} \textit{Label} \text{instances and pseudo-instances}} $$
(35)

All above metrics are normalized in the range \(\left \langle 0, 1 \right \rangle \). The metrics have been applied to the Social training group case study. The results obtained have been shown in the Table 15.

The results show that in this case study the XML metamodel has been the most expressive in terms of Egg mapping, because has the highest value for EgT. This means that this metamodel served as the best one for realizing the nested structures. The direct application of node nesting and mapping the parent vertices to the Egg helped to achieve the high level of this value. The highest values of EgF are obtained by the RDB and the AOM. This stems from the fact, that those metamodels strictly follow the data schemata in order to implement the structural databases, thus require manifestations of the additional Egg instances. VT is the highest for the XML and the UML, which shows strict correspondence of EGG V ertex category instances to its mappings. Metamodels such as the RDF and the AOM need additional vertices, so they have non-zero VF values. This happens in RDF due to the triple nature, which requires that describing with Label creates artificial V ertex. For the AOM this is the result of the additional collection creations, because the associations cannot be data-labeled. The highest metrics for the translated edges ET was obtained by the AOM, due to the high semantic capacity of the associations and the roles. The lowest is for the RDF and the RDB, since they do not convey a lot of information in edge representations. For the RDF, most of semantics is determined by the RDFS which is located on the intensional level and thus is out of the scope of this paper. The RDB does not have the direct relationships thus they have been mapped to the referential integrity constraints (the foreign key constraints). Regarding the Label metrics, almost all data metamodels obtained the maximum value. The metamodels with value 0.9 force the additional labels, since strict structuring requires them to introduce neutral NULL-like values.

We have aggregated the obtained results using the following aggregation function:

$$ Agg = \frac{\sum\limits_{\mu \in \{Eg, V, E, L\}} w_{\mu} \cdot \frac{\mu_{T}}{\mu_{T} + \mu_{A} + \mu_{F} + \mu_{L}}}{\sum\limits_{\mu \in \{Eg, V, E, L\}}w_{\mu}} $$
(36)

where w denotes the weight. The weight vector contains the number of the specific category instances in the original Egg. The aggregation results along with the partial aggregations have been presented in the Table 16. Results show that the UML and the XML realize the original Egg in the most data semantics preserving way. The RDB and the RDF are the opposite.

Table 16 The aggregated results of the Social training group EGG case study evaluation

8 Conclusions

The EGG was defined to create the data models and due to its generality it forms a concept, which may be used for a comparative studies of the features of the other data modeling oriented metamodels. As the result, the EGG constitutes a reference metamodel and the modeling language when the data modeling problems are investigated. Its referential nature was demonstrated in Section 4 for mapping between the EGG and selected metamodels as well as in Section 6 for identifying the EGG features availability in the selected metamodels. The set of the metamodels taken into account can be extended to the other existing metamodels. Alternatively, the proposed mechanism can be applied for identifying the features of the newly discovered metamodels in future.

The EGG structures may also be used in various applications in computer science and machine learning tasks e.g. for computational problems which rely on network-like structures (Bayesian networks [40] or neural networks [41]) and for complex methods for graph signal processing [42].

The paper introduces a novel concept of measuring the expressiveness of metamodels in comparison to the EGG metamodel expressiveness. The measure is semantics-related one, so it compares the metamodels expressiveness via comparison of particular models created in the original model with the model transformed to EGG abstract syntax. This measure may help decide which metamodel is sufficient for modeling a particular reality. Other measures, the syntax-related ones may be also applied for leading comparative studies, what is planned to show in the succeeding publications.

In order to achieve the challenge of measuring the data metamodels expressiveness complete EGG abstract syntax was introduced in the paper. It was done through its implementations in the AOM and the UML modeling languages. Then two concrete syntaxes were defined and EGG categories semantics was specified. All these aspects together form the strong and the well defined mechanisms for working with the EGG. They make it possible to apply the EGG concept to very wide application domains. Its applicability is shown on the example of two case studies. One case study is an extension of the case study shown in the previous publication. The second one is dedicated to knowledge modeling, so it shows that the EGG is applicable for AI domain as well.