1 Introduction

Knowledge Graphs (KGs) are used to represent relationships between different entities such as people (e.g., Tom Hanks), places (e.g., Rome), events (Pitchfork Festival), and so on [24]. They organize knowledge in graph structures where the meaning of the data is encoded alongside the data in the graph. RDF is a data model for representing KGs that come with an ecosystem of languages and protocols to foster interoperable data management. In RDF, graph nodes represent entities, identified by URIs, or literals (e.g., strings, numbers, etc.); edges represent relations between entities or between entities and literals, which are identified by RDF properties. Entities and literals are associated with types (classes, e.g., dbo:City or datatypes, e.g., xmls:integer). The sets of possible types and properties are organized into ontologies, which specify the meaning of the used types and properties through logical axioms. Intuitively, ontologies provide the schema of the KG, but, remarkably, data and schema are loosely coupled in KGs, with potential mismatches and diverging evolution along time. In addition, KGs are often very large and evolve along time. The classical example of this evolution is the Linked Data CloudFootnote 1Footnote 2 which has evolved with roughly 1255 datasets as of February 2021.

KGs support several data-intensive tasks related to data management, information integration, natural language processing, and inference in research and industry [31]. Eventually, they feed, often in combination with machine learning methods, an increasing number of downstream applications such as recommender systems [21] and question answering interfaces [12]. Existing edges can also be used to train soft inference models, e.g., based on knowledge graph embeddings, so as to predict missing or probable arcs based on latent features. These models have been used for biomedical applications [29, 30], e.g., to predict drug–target interactions. For downstream applications, and, especially, for applications that combine KGs and machine learning, it is important to support domain experts with a clear picture of the content, structure and quality of the input KG. Low-quality or misinterpreted input data may lead to unreliable output models, as expressed by the well-known colloquial motto “garbage in–garbage out.”

ABSTATFootnote 3 is a data profiling approach [46] and tool [35] introduced to let users explore the content and structure of large KGs and also inspect potential quality issues. ABSTAT takes an RDF dataset, and (optionally) an ontology (used in the dataset) as input, and computes a semantic profile. The semantic profile consists of a summary, which provides an abstract, but complete description of the dataset content, and some statistics.

The informative units of ABSTAT’s summaries are Abstract Knowledge Patterns (AKPs), named simply patterns in the following, which have the form \((\texttt {subjectType, pred,} \texttt { objectType})\).

Patterns represent the occurrence of triples < sub, pred, obj> in the data, such that subjectType is the most specific type of the subject and objectType is the most specific type of the object [46].

For example, the pattern \((\texttt {dbo:SoccerPlayer,} \texttt {dbo:team, dbo:SoccerClub})\) represents the occurrence of triples that represent entities of type dbo:SoccerPlayer linked with entities of type dbo:SoccerClub through the property dbo:team. The types dbo:SoccerPlayer and dbo:SoccerClub are the most specific types for the respective entities, which may have also more generic types such as dbo:Athlete or dbo:Person and dbo:SportsClub or dbo:Organisation.

The most specific type is computed with the help of the ontology. Such a choice allows ABSTAT to have a compact but complete summary, by excluding several more generic redundant patterns using the ontology.

ABSTAT profiles can be explored by users through its web interface or by machines using APIs. Rich profiles as the ones computed in ABSTAT support automatic feature selection for semantic recommender systems [15, 36], vocabulary suggestions for data annotation, as in [40], and help in the detection of quality problems [45, 46].

When processing large KGs, as it is often the case for downstream machine learning tasks, it is critical to reduce the latency between processing the graph and the availability of the results. Such latency is often a result of platform start-up costs (e.g., MapReduce [34]) or the complexity of graph processing algorithms.

Sometimes the time needed to compute the results may reach several hours or even days, up to the eventual failure of the computation. From a recent survey on the challenges of large graph processing [39], scalability is the most pressing challenge faced by all participants, who reported problems in processing very large graphs efficiently. The reported limitations include inefficiencies in loading, updating, and performing computations on large graphs. Processing real-world graphs often surpasses the capability of single computers.

A solution to these challenges is switching to the distributed computing paradigm and deploying large graph processing algorithms on a collection of computing nodes, whose configuration can fit storage and computing resources according to the end users’ requirements. However, it has been reported that many algorithms at the core of the existing techniques are not ready to be implemented on top of today’s graph processing infrastructures, which rely on horizontal scalability [37], with many algorithms being inherently sequential and difficult to parallelize.

Several approaches have been proposed to adapt models and frameworks for graph processing to the distributed computing paradigm [2, 32, 33, 50].

In the domain of KG management and profiling, Sansa is the most notable example of a natively distributed solution. It provides a unified framework for several downstream tasks such as link prediction, knowledge base completion, querying, reasoning and, also, profiling [25]. Similarly to ABSTAT, it has a modular architecture and provides to the end user 32 RDF statistics (such as the number of triples, RDF terms, properties per entity, and usage of vocabularies across datasets), and apply quality assessment in a distributed manner. However, Sansa profiling is not based on a summarization model; ABSTAT profiling digs deeper into semantic features of the KG and makes use of ontologies associated with the data.

Among the several statistics calculated by ABSTAT, one of the most unique and hardest to compute are cardinality descriptors. Cardinality descriptors provide information about the relationships between subjects and objects of the triples at the pattern level,which reveal valuable information about the data. For example, cardinality statistics for the pattern \((\texttt {dbo:} \texttt {Company, dbo:keyPerson, owl:Thing})\) in DBpedia 2015-10 reveal that 5263 different entities of type dbo:company are connected with a unique entity of type owl:Thing through the property dbo:keyPerson [45].

We could identify that such entity is dbr:Chief_executive_officer which in DBpedia does not have a type, thus defying purely syntactic vocabulary-based statistics. This generic entity is used as placeholder for 5263 companies with unspecified CEO, including dbr:Kodak, dbr:Telefónica, dbr:Allianz, and many others. In semantic-aware ABSTAT profiles, such an entity is associated with the default upper type owl:Thing and featured as outlier in the cardinality estimated for the pattern, leading to spotting the anomaly. What would have happened if we had trained a link predictor using these triples as positive samples? For large KGs, cardinality descriptors, which have also been proved useful in recommender systems [15, 36], could be computed only with extremely large execution times or could not be computed at all.

In this paper, we present ABSTAT-HD, that is, Highly Distributed ABSTAT. The framework supports the distributed computation of ontology-based summaries and profiles of very large KGs, thus overcoming scalability problems of our previous solution. To the best of our knowledge, this is the first KG profiling approach that supports summarization on a distributed computing infrastructure. The framework supports full ABSTAT profiles that include novel features such as cardinality descriptors, which were used in previous work [15] but never properly defined within our model. In addition, the framework supports property-based minimalization, pattern inference, and instance count statistics, new features that complete the strategy adopted in ABSTAT to remove redundant patterns and better count the data represented by the patterns. In conclusion, we can summarize the main contributions of this paper with respect to the previous work as follows:

  • A formal and complete definition of the summarization model which is the backbone of ABSTAT tool.

  • A new algorithm based on the relational model for calculating the summary model.

  • ABSTAT-HD, a highly distributed and scalable tool for processing and producing profiles for very large RDF graphs.

  • A set of experiments that show the scalability of ABSTAT-HD with respect to the previous version of ABSTAT.

  • A report about quality issues found in the very large Microsoft Academic Knowledge Graph, to provide more qualitative insights into the informativeness of our profiles.

This paper is organized as follows: Sect. 2 formally introduces the ABSTAT summarization model, while in Sect. 3 we present the process of profile construction and in Sect. 4 the architecture of ABSTAT-HD. A large set of experiments over ABSTAT-HD to evaluate the scalability using existing large KGs under different controlled hardware configurations are discussed in Sect. 5. Section 6 discusses the related work of existing profiling tools for KGs and tools for graph processing while in Sect. 7 we draw conclusions and future work.

2 Profiling model

We first introduce some preliminary definitions needed to explain ABSTAT profiles, then present the summarization and profiling models used in the paper.

2.1 Preliminaries: datasets, assertions, and terminologies

ABSTAT is developed to profile RDF data, which natively represent KGs. In the rest of the paper we define and use the term “dataset" to be equivalent to “KG." In fact, our profiling model is formalized to be applicable to any KG that can be interpreted as a set of triples \(<subject,predicate,object>\), and where entities (individuals) and literals are associated with types. Ontologies, which formally specify the terminology used to describe the entities, are leveraged to make profiles more compact. Ontologies for RDF data are usually specified with axioms of the RDFS and OWL2 languages,Footnote 4 which are interpreted as Description Logics (DLs) axioms [47].

We define a dataset (equivalently, a KG) by borrowing the definition of Knowledge Base in DLs, i.e., as consisting of a terminology (TBox in DLs—intuitively, the schema) and a set of assertions about individuals (ABox in DLs— intuitively, the actual data).

Definition 1

Dataset: A dataset \(\varDelta =({\mathcal {T}},{\mathcal {A}})\) is a pair, where \({\mathcal {T}}\) is a set of terminological axioms, and \({\mathcal {A}}\) is a set of assertions.

We define more in details \({\mathcal {A}}\) first, being the actual data the focus of our profiles, and \({\mathcal {T}}\) afterward, which supports the profiling process. Since DLs are tractable fragments of the well-known First-Order Logics (FOL), we find more convenient to present our model using a FOL notation for axioms in \({\mathcal {A}}\) and \({\mathcal {T}}\).

We use symbols like C, to denote types (unary predicates in FOL), symbols like P, Q to denote properties (binary predicates in FOL), and symbols a,b to denote named individuals or literals (constants in FOL).

Assertions in \({\mathcal {A}}\) are of two kinds: typing assertions having the form C(a), and relational assertions having the form P(ab), where a is a named individual (or, simply, individual) and b is an individual or a literal. We denote the sets of typing and relational assertions by \({\mathcal {A}}^{C}\) and \({\mathcal {A}}^{P}\), respectively. We consider \(C(a)\in {\mathcal {A}}\) whenever we find RDF triples of the form \(<a,\texttt {rdf:type},C>\), where a and C are URIs, or \(<a,P,b^{\tiny {\wedge \wedge }} C>\), where b is a literal and C its datatype. In addition, we assign to each untyped individual or literal occurring in \({\mathcal {A}}\) a default type, that is, respectively, \(\texttt {owl:Thing}\) or \(\texttt {rdfs:Literal}\). A literal occurring in a triple can have at most one type (because typing is implicitly encoded in triples like \(<a,P,b^{\tiny {\wedge \wedge }} C>\)). Conversely, an individual can have many types. A relational assertion P(ab) is any triple \(<a,P,b>\) such that \(P\notin M^P\), where \(M^P\) is a set of predicates that are reserved for modeling purposes. In this set we include rdf:type and all the predicates used to model the terminology (e.g., rdfs:subClassOf, rdfs:domain, any other predicate that is not considered relevant for the profile).

The terminology \({\mathcal {T}}\) may contain an arbitrary set of axioms, but our profiling model uses only axioms specifying that C is subtype of D (subtype axioms) and P is subproperty of Q (subproperty axioms), which can be expressed by the FOL formulas \(\forall x (C(x)\rightarrow D(x))\) and \(\forall x,y (P(x,y)\rightarrow Q(x,y))\), respectively. We apply a completion of \({\mathcal {T}}\) inspired by OWL2 semantics. We inject the types and properties that occur in \({\mathcal {A}}\) into \({\mathcal {T}}\) and specify their upper types and properties: all named classes and datatypes occurring in \({\mathcal {A}}\) are subtype, respectively, of \(\texttt {owl:Thing}\) and \(\texttt {rdfs:Literal}\); all the properties that have some object that is an individual are subproperties of \(\texttt {owl:TopObjectProperty}\) and all the properties that have some object that is a literal are subproperties of \(\texttt {owl:TopDataProperty}\). Therefore, if an ontology is not associated with a dataset to be profiled, we can still consider default \({\mathcal {T}}\) built from \({\mathcal {A}}\).

With \(V^{{\mathcal {T}}}\) we refer to the terminology-level vocabulary of a dataset, which consists of a set \(\texttt {N}^{C}\) of types (which always include \(\texttt {owl:Thing}\) and \(\texttt {rdfs:Literal}\)) and a set \(\texttt {N}^{P}\) of properties (which always include \(\texttt {owl:TopObject}{} \texttt {Property}\) and \(\texttt {owl:TopDataProperty}\)).

Fig. 1
figure 1

A small graph representing a dataset

2.2 Ontology-based summarization

Abstract Knowledge Patterns (AKPs), equivalently referred to as patterns in the rest of the paper, represent schema-level patterns used to model assertions about individuals in a given domain. In particular, we consider patterns that model the existence of entities with certain properties and can be formalized by existentially quantified formulas in FOL.

Definition 2

Patterns: A pattern is a triple (CPD), such that C and D are types and P is a property, which is interpreted by the FOL formula \(\exists x \exists y (C(x) \wedge D(y) \wedge P(x,y))\).

Intuitively, an existential pattern, states that there are individuals of type C that are linked to individuals or literals of a type D by a predicate P.

Our goal is to summarize a dataset, and, more specifically, the assertions \({\mathcal {A}}\), by defining a set of patterns that represent the full content of \({\mathcal {A}}\) in a compact way. Profiles will add statistics about the assertions represented by each pattern to the summaries.

Definition 3

Patterns and represented assertions: A pattern (CPD) represents a relational assertion \(P(a,b) \in {\mathcal {A}}\) iff there exist a set \(\{C(a),D(b),P(a,b)\}\subseteq {\mathcal {A}}\).

We denote with \(\varPi ^{{\mathcal {A}}}\) the set of patterns that represent all the relational assertions in \({\mathcal {A}}\). To make summaries compact, we observe that many of the patterns that represent a relational assertion can be inferred from a small subset of more specific patterns if we consider constraints between types and properties specified in the terminology \({\mathcal {T}}\). Consider the example graph depicted in Fig. 1 where typing assertions and subclass/subproperty constraints are depicted as arcs between nodes, and upper level types and properties are omitted (namely, \(\texttt {owl:Thing}\), \(\texttt {rdfs:Literal}\), \(\texttt {owl:TopObjectProperty}\) and \(\texttt {owl:TopData}{} \texttt {Property}\)). The assertion P(ab) is represented by many different patterns, where \(\exists x \exists y (C(x) \wedge F(y) \wedge P(x,y))\) and \(\forall x (C(x)\rightarrow A(x))\) obviously imply \(\exists x \exists y (A(x) \wedge F(y) \wedge P(x,y))\), as well as \(\exists x \exists y (A(x) \wedge D(y) \wedge P(x,y))\) and so forth via subtype axioms. Based on this principle, the model used in basic ABSTAT summaries and presented in previous work [46] would consider (CPF) as the one most specific representative pattern for P(ab).

Here we complete the model by solving the unbalanced treatment of types and properties, where also properties can have dependencies specified by subproperty axioms. In the example, the pattern (CQF) is even more specific than (CPF) because \(\forall x (Q(x,y)\rightarrow P(x,y))\), which also let us infer P(ab) from Q(ab), i.e., P(ab) is redundant in the set of assertions based on the terminology. Therefore, we need to generalize our previous model by defining a subpattern relation over the patterns to represent that in a pair of patterns one is more specific than the other.

For simplicity, we introduce a terminology graph \(G^{{\mathcal {T}}}\) as a proxy that represents relations among types/properties specified by axioms in \({\mathcal {T}}\). By posing \(|\sim \) as a relation between a terminology and subtype/subproperty relations derived from it, we define a terminology graph as follows.

Definition 4

Terminology Graph: A terminology graph is the disjoint sum [48] of two posets: a type poset \((\texttt {N}^C,\preceq ^{G^{{\mathcal {T}}}})\) such that \(\texttt {N}^C\) is a set of types and for all \(C,D \in \texttt {N}^C\), \(C \preceq ^{G^{{\mathcal {T}}}} D\) iff \({\mathcal {T}}|\sim C \preceq ^{G^{{\mathcal {T}}}} D\); a property poset \((\texttt {N}^P,\preceq ^{G^{{\mathcal {T}}}})\) such that \(\texttt {N}^P\) is a set of properties and for all \(P,Q \in \texttt {N}^P\), \(P \preceq ^{G^{{\mathcal {T}}}} Q\) iff \({\mathcal {T}}|\sim P \preceq ^{G^{{\mathcal {T}}}} Q\).

To specify the relation \(|\sim \) we can rely either on explicit or on inferred axioms in \({\mathcal {T}}\). We prefer the first strategy because of practical reasons: some web ontologies may have unintended inferences that can mess up the intended type and property hierarchies. Similar reasons also suggest us ignoring equivalence relations between named classes and properties, which frequently introduce counter-intuitive inferences (e.g., node collapse), like further discussed in previous work [46].Footnote 5

Now we can transfer the order relation \(\preceq ^{G^{{\mathcal {T}}}}\) from the terminology graph to the patterns, by defining a product partial order \((\texttt {N}^C \times \texttt {N}^P \times \texttt {N}^C ,\preceq ^{G^{{\mathcal {T}}}})\) that can be interpreted as a subpattern relation as defined below.

Definition 5

Subpatterns: A pattern (CPD) is a subpattern of a pattern \((C',Q,D')\) wrt. a terminology graph \(G^{\mathcal {T}}\), denoted by \((C,P,D) \preceq ^{G^{{\mathcal {T}}}} (C',Q,D')\) iff \(C' \preceq ^{G^{{\mathcal {T}}}} C\), \(D' \preceq ^{G^{{\mathcal {T}}}} D\) and \(Q\preceq ^{G^{{\mathcal {T}}}} P\).

Observe that this poset has, by definition, two upper level patterns: \((\texttt {owl:Thing},\texttt {owl:TopObjectProperty},{} \texttt {owl:Thing})\) and \((\texttt {owl:Thing},\texttt {owl:TopData}{} \texttt {Property},\texttt {rdfs:Literal})\). The subpattern relation is eventually used to select, for some input relational assertion, those patterns that are more specific, i.e., minimal in the subpattern poset, among the patterns that represent it.

We observe that some relational assertions can be inferred from other relational assertions and the property poset \((\texttt {N}^P,\preceq ^{G^{{\mathcal {T}}}})\), e.g., P(ab) can be inferred from Q(ab) whenever \(Q \preceq ^{G^{{\mathcal {T}}}} P\).

Let us consider the strict order relations \(\prec ^{G^{{\mathcal {T}}}}\) that are the irreflexive counterparts of the posets induced by \(\preceq ^{G^{{\mathcal {T}}}}\), where \(X \prec ^{G^{{\mathcal {T}}}} Y\) imply that \(X \ne Y\) whatever X and Y are (types, properties or patterns). We say that a relational assertion \(P(a,b)\in {\mathcal {A}}\) is redundant (based on \(G^{{\mathcal {T}}}\)) if and only if there exist some relational assertion \(Q(a,b)\in {\mathcal {A}}\) such that \(Q\prec ^{G^{{\mathcal {T}}}}P\). Since the property poset is finite, there are minimal properties that ensure that, given a redundant assertion P(ab), we can always define a set of relational assertions from which P(ab) can be inferred.Footnote 6 We refer to this set of non-redundant assertions from which P(ab) can be inferred as the \(G^{{\mathcal {T}}}\)-inference base of P(ab).

Definition 6

Minimal Patterns: A pattern \(\pi \) is a minimal pattern for a relational assertion \(P(a,b)\in {\mathcal {A}}\) and a terminology graph \(G^{\mathcal {T}}\) iff one of the two following conditions applies: 1—P(ab) is not redundant and \(\pi \) represents P(ab) and there does not exist a pattern \(\pi '\) that represents P(ab) such that \(\pi ' \prec ^{G^{{\mathcal {T}}}} \pi \); 2—P(ab) is redundant and \(\pi \) represents some assertion Q(ab) such that Q(ab) is in the \(G^{{\mathcal {T}}}\)-inference base of P(ab) and there does not exist a pattern \(\pi '\) that represents Q(ab) such that \(\pi ' \prec ^{G^{{\mathcal {T}}}} \pi \).

Observe that in the first case a minimal pattern will have the form (CPD), while in the second case it will have the form (CQD) with \(Q \prec ^{G^{{\mathcal {T}}}}P\)

In the rest of the paper we use the following expressions: a pattern \(\pi \) minimally represents a relational assertion \(P(a,b)\in {\mathcal {A}}\) (under a terminology graph \(G^{{\mathcal {T}}}\)), iff \(\pi \) is a minimal pattern for P(ab) and \(G^{{\mathcal {T}}}\). Conversely, we say that P(ab) is minimally represented (under a terminology graph \(G^{{\mathcal {T}}}\)) by all patterns that minimally represent it. By applying pattern minimalization to a set of relational assertions (with an input terminology), we obtain the set of patterns that minimally represent all of them, also referred to as its Minimal Pattern Base (MPB). A summary consist in a terminology graph and an MPB for an input dataset \(\varDelta =({\mathcal {T}},{\mathcal {A}})\).

Definition 7

Minimal Pattern Base: A minimal pattern base for a set of assertions \({\mathcal {A}}\) under a terminology graph \(G^{\mathcal {T}}\) is a set of patterns \(\varPi ^{{\mathcal {A}},{\mathcal {T}}}\) such that \(\pi \in \varPi ^{{\mathcal {A}},{\mathcal {T}}}\) iff \(\pi \) minimally represents some \(\phi \in {\mathcal {A}}^{{\mathcal {P}}}\) under \(G^{{\mathcal {T}}}\).

Definition 8

Summary: A summary of a dataset \(\varDelta =({\mathcal {A}},{\mathcal {T}})\) is a pair \(\varSigma =(G^{{\mathcal {T}}},\varPi ^{{\mathcal {A}}, {\mathcal {T}}})\) such that: \(G^{{\mathcal {T}}}\) is a terminology graph derived from \({\mathcal {T}}\), \(\varPi ^{{\mathcal {A}},{\mathcal {T}}}\) is a minimal pattern base for \({\mathcal {A}}\) under \(G^{{\mathcal {T}}}\).

Observe that different patterns can be extracted for an assertion P(ab) if a and/or b have more than one minimal type. However, minimalization is capable to exclude many patterns that can be entailed following the \(\preceq ^{G^{{\mathcal {T}}}}\) relation and that do not minimally represent any P(ab).

For example, the MPB for the dataset in Fig. 1, includes the patterns (EQF), (ERT), (CQF), (CRT), that is, only four of the twenty-four patterns that represent the assertions (in this count we excluded patterns including upper types and properties—omitted in the figure). The MPB excludes patterns like (BQD) and (CQD), but also (CPF), as a result of considering properties in the minimalization process and extending the relation \(\preceq ^{G^{{\mathcal {T}}}}\) over \(N^P\).

Although very few ontologies make use of subproperty relations intensively, we believe that minimalization wrt both type and property hierarchy is important to generalize the model and to provide a more robust summarization mechanism for future scenarios. However, to provide users with flexible configuration choices, minimalization over properties is optional and can be disabled keeping only type-based minimalization.

Definition 7 extends the definition of minimal patterns that considers type-based minimalization [46]. Observe that when we minimalize over properties, redundant relational assertions become irrelevant for including patterns in the MPB: the patterns that minimally represent redundant relational assertions are patterns that minimally represent also some not redundant assertions. A similar approach based on the identification of redundant assertions can be applied also to typing assertions, some of which can be inferred from \({\mathcal {T}}\) and \(G^{{\mathcal {T}}}\) and thus considered redundant. We define redundant typing assertions similarly as we did for redundant relational assertions. Observe that also for a redundant typing assertion C(a) it is always possible to track the set of non-redundant assertions C(a) is inferred from based on \(G^{{\mathcal {T}}}\). The computation of the minimal pattern base will use this intuition and prune redundant relational and typing assertions from \({\mathcal {A}}\) so as to compute the patterns that represent the non-redundant assertions (we remind that, based on Definition 3, a pattern (CPD) represents a relational assertion P(ab) in an assertion set \({\mathcal {A}}\) if and only if \(\{C(a),P(a,b),D(b)\}\subseteq {\mathcal {A}})\).

Let \({\mathcal {A}}^{-}\) be \({\mathcal {A}}-\{\phi \ \vert \ \phi \in {\mathcal {A}}\) and is redundant based on \(G^{{\mathcal {T}}}\}\); we refer to \({\mathcal {A}}^{-}\) as to the non redundant counterpart of \({\mathcal {A}}\). Then the following equivalence can be proved (see the Appendix for proof).

Theorem 1

An MPB \(\varPi ^{{\mathcal {A}},{\mathcal {T}}}\) for a set of assertions \({\mathcal {A}}\) under a terminology graph \(G^{\mathcal {T}}\) is equivalent to the set \(\varPi ^{{\mathcal {A}}^{-}}\) of patterns that represent every relational assertions in \({\mathcal {A}}^{-}\).

The above theorem is also useful to better explain how ABSTAT-HD can be adapted to compute summaries incrementally and deal with changes in the ABox by updating profiles and statistics locally. The key idea is that if changes in the ABox concern redundant assertions profiles and statistics do not change; if changes affect new or non-redundant assertions, the profiles are updated after tracking the assertions that can be inferred from the ones affected by changes (e.g., some assertions may change their status from redundant to non-redundant or vice versa). However, in this paper we focus on the implementation and validation of batch profiling, leaving some additional details about incremental profiling in the Appendix.

2.3 Profiles and statistics

A profile extends a summary of a dataset by associating statistics with its patterns in its vocabulary, referred to as \(V^\varSigma \).

Definition 9

Profile: A profile of a dataset \(\varDelta =({\mathcal {A}},{\mathcal {T}})\) is a pair \((\varSigma ^{{\mathcal {A}}, {\mathcal {T}}},S)\) such that \(\varSigma ^{{\mathcal {A}}, {\mathcal {T}}}\) is a summary with a minimal pattern base \(\varPi ^{{\mathcal {A}},{\mathcal {T}}}\), and S is a set of functions \(s: \varPi ^{{\mathcal {A}},{\mathcal {T}}} \cup V^\varSigma \rightarrow {\mathbb {R}} \).

Pattern statistics are computed on the patterns that are in the summary (i.e., which minimally represent some relational assertion in the dataset) by considering the assertions that they minimally represent or represent. While some basic statistics like pattern frequency [46] can be computed by processing one assertion at a time (yet, with scalability problems for very large datasets), new statistics like cardinality descriptors—defined below—require methods to group assertions by their representative patterns. As a consequence, they can be hardly computed without the distributed profiling solution described in this paper also for datasets of relatively smaller size.

Pattern frequency count. The frequency of a pattern \(\pi \) is defined as the number of non-redundant relational assertions it minimally represents. Note that the frequency of a pattern is always a value between one and the number of non-redundant relational assertions it represents.

Pattern instances count. Let us first define what we mean with instances of a pattern \(\pi \):

$$\begin{aligned} inst(\pi ) = {\left\{ \begin{array}{ll} \{P(a,b) \in {\mathcal {A}}^{-} | \pi \;\text {represents} \; P(a,b) \} &{} \text {if}\ sub(\pi )=\{\pi \} \\ \displaystyle \bigcup _{\forall \rho \in sub(\pi )} inst(\rho ) &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

where \(sub(\pi ) = \{\rho \in MPB| \rho \preceq ^{G^{{\mathcal {T}}}}\pi \}\) is the set of subpatterns of \(\pi \). We therefore define the number of instances for a pattern \(\pi \) as the number of relational assertions in \(inst(\pi )\), that is, the number of non-redundant relational assertions represented by \(\pi \) or its subpatterns.

Values for this statistic are always positive and also \(\rho \in sub(\pi )\) implies that the number of instances for \(\pi \) will be greater than or equal the number of instances for \(\rho \). Please note that inst is defined \(\forall \pi \in MPB\) but it can be easily extended to \(\forall \pi \in \{N^C \times N^P \times N^C\}\) i.e., the set of every possible pattern in \({\mathcal {T}}\). Such extension would admit zero values as there may exist some pattern \(\psi \) whose \(inst(\psi )\) is empty as may be too specific for the dataset. Moreover, extending this statistic would enable further analysis at many levels of abstraction (e.g., patterns external to the MPB).

Type occurrence count. The number of occurrences for a type C is the number of entities a such that \(C(a) \in {\mathcal {A}}^{{\mathcal {C}}}\).

Property occurrence count. The number of occurrences for a concept P is the number of relational assertions \(P(a,b) \in {\mathcal {A}}^{{\mathcal {P}}}\).

Pattern cardinality descriptors. Cardinality descriptors are divided into direct cardinality descriptors and inverse cardinality descriptors. Given a pattern (CPD) the maximum (minimum, average) direct cardinality is the maximum (minimum, average) number of distinct entities of type C (in subject position) linked to a single entity of type D through the predicate P. Similarly, the maximum (minimum, average) inverse cardinality is the maximum (minimum, average) number of distinct entities of type D (in object position) linked to a single entity of type C through the predicate P. Intuitively it tells us how the assertions represented by a pattern \(\pi =(C,P,D)\) are balanced in terms of links between individuals in subject position and individuals/literals in object position for \(\pi \) in both directions through P.

3 Profiling process

In this section, we describe the profiling process. First, we present the workflow to construct the summary and the respective statistics for each dataset. Second, we provide the profile creation using the relational model.

Fig. 2
figure 2

Profiling workflow

3.1 Profile creation

The profiling workflow of ABSTAT is depicted in Fig. 2. In a first preprocessing step, the assertion set is extracted, the set \({\mathcal {A}}^C\) of typing assertions is singled out from the set of relational assertions \({\mathcal {A}}^{{\mathcal {P}}}\), and the terminology graph is created using the input terminology. We then perform three operations: type minimalization (over \({\mathcal {A}}^C\)) and property minimalization (over \({\mathcal {A}}^{{\mathcal {P}}}\)), to compute a minimal type set for each entity and remove property redundancy, and type inference to infer all the entity types. We extract minimal patterns and statistics and compute cardinality descriptors. We also use \({\mathcal {A}}^{{\mathcal {P}}}\) and inferred types to infer patterns along the subpattern relation and compute statistics that require inference.

Core-profiling consists in the preprocessing, type minimalization, and pattern calculation steps, full-profiling includes also all the other steps.

Fig. 3
figure 3

From preprocessing to pattern calculation

Preprocessing. Preprocessing is explained with an example in Step 1 of Fig. 3.

Observe that relational assertions are finer-grained classified based on the type of the object in the assertion. Assertions with a named entity in the object are called object relational assertions (e.g., < Cher genre Disco>) while assertions with a literal in the object are called datatype relational assertions (e.g., < Cher alias "Cher Bono">). The terminology graph \(G^{\mathcal {T}}\) is built starting from the \(\texttt {rdfs:subClassOf}\) and \(\texttt {rdfs:subPropertyOf}\) relations specified in the terminology and managed with a library for managing OWL2 ontologies.

The graph will be then completed with external types, that is, types asserted in \({\mathcal {A}}^C\) and not included in \({\mathcal {T}}\), when computing the minimal types.

Type minimalization and property minimalization. For each individual x, we compute the set \(M_x\) of minimal types with respect to the terminology graph \(G^{\mathcal {T}}\) as exemplified in Fig. 3. Given x, we select all the typing assertions \(C(x) \in {\mathcal {A}}^C\) and form the set \({\mathcal {A}}^C_x\) of typing assertions about x. Please refer to our previous paper in [46] for more details about the algorithm on type minimalization.

In Step 2, ABSTAT performs type minimalization. As Cher has two types: MusicalArtist and Artist, and MusicalArtist is the subtype of Artist, only the former type is included in the patterns. If minimalization on properties is enabled, we remove redundancies from \({\mathcal {A}}^P\).

Consider Step 3 in Fig. 3, since \(\texttt {alias} \preceq ^{G^{{\mathcal {T}}}} \texttt {alternativeName}\), the triple < Cher alternativeName "Cher Bono"> is considered redundant as < Cher alias "Cher Bono"> is also present in \({\mathcal {A}}^P\), therefore it is removed.

Minimal pattern base. We then iterate over each relational assertion \(P(x,y) \in {\mathcal {A}}^P\) and get the minimal types sets \(M_x\) and \(M_y\). Finally, \(\forall C, D \in M_x, M_y\) a pattern (CPD) is added to the minimal types pattern base. Step 4 in Fig. 3 takes minimal types and relational assertions as input and computes the patterns. The MPB for the example in Fig. 3 is reported in the bottom box.

figure a

Pattern inference. ABSTAT computes the subpattern relation by inferring the patterns that are more generic of the patterns included in the MPB.

Algorithm 1 presents the pseudocode for computing pattern inference. We start initializing \(Inf_{Pattenrs}\) to \(\emptyset \) (line 1), then for each relational assertion P(xy) we calculate every inferable type for x and y and every inferable property for P (line 3–5). Notice that at this point \(Inf_x\) includes \({\mathcal {A}}^C_x\), \(Inf_y\) includes \({\mathcal {A}}^C_y\) and \(Inf_p\) includes P. Finally, \(\forall C', D', Q \in Inf_x, Inf_y, Inf_p\) a pattern \((C',Q,D')\) is added to \(Inf_{Pattenrs}\) (lines 6–13). We keep trace of the times that a pattern is added to the \(Inf\_Patterns\) set to obtain the number of instances.

In Fig. 4, Step 5 shows each entity with its inferred types until \(\texttt {Thing}\) (or \(\texttt {Literal}\) for literals) is reached. For example \(\texttt {Funk}\) passes through \(\texttt {Genre}\) and \(\texttt {TopicalConcept}\) before it reaches \(\texttt {Thing}\). Note that, unlike as in type minimalization, here we want to extract all the possible types from each entity with the support of \(G^{\mathcal {T}}\). In Step 6, for each assertion in \({\mathcal {A}}^P\) we get the entities’ inferred types and extract the superproperties for each property using \(G^{\mathcal {T}}\) to finally generate the inferred pattern set along with the number of instances. For example <Cher genre Disco> generates 5*4*2 patterns (5 types for Cher, 4 types for Disco and 2 properties). Despite the huge number of generated patterns through pattern inference, we still keep only minimal patterns in our summaries but enriching statistics with the number of instances calculated in this phase.

Fig. 4
figure 4

Pattern inference, instances count and cardinality descriptors calculation workflow

Cardinality descriptors. Algorithm 2 takes as argument the set triples_AKP\(_{i}\), which contains the relational assertions that have AKP\(_i\) as minimal pattern. We start by creating a map for the subjects and for the objects that will contain the counts for each subject and object, respectively (lines 1–2). Then for each assertion P(xy) we count subjects and objects and keep this information on subjects and objects (lines 3–10). Each entry in subjects tells us the number of distinct objects associated with the respective key. Similarly for objects. For direct cardinality descriptors, we calculate the maximum, minimum, and average values for the values of the objects map (11–13). For inverse cardinality descriptors, maximum, minimum and average are calculated for the values of the objects map (lines 14–16). We can think of cardinality descriptors as grouping \({\mathcal {A}}^P\) assertions by their minimal patterns as depicted in Fig. 4 (Step 7). For each pattern, we can now extract statistics (Step 8) on subjects and objects and thus obtain the cardinality descriptors.

3.2 Profile creation via relational model

Algorithms for profile creation shown in the previous section have a linear complexity that makes nearly impossible the definition of a profile for a very large dataset. For this reason, we adopt a relational model approach. By using a relational model approach, it is then possible to implement an algorithm by means of a high scalable engine such as Spark SQL [4].

figure b

Let D(tspod) be the original dataset where each triple is enriched with two attributes; attribute t that specifies the type of assertion (typing_assertion, object relational_assertion, datatype relation_ assertion) and attribute d that specifies the datatype of a given literal. In the preprocessing phase, we first create three new relations \(T_a,O_a,D_a\) where

$$\begin{aligned} T_a&=\sigma _{D.t="typing"}(\varPi _{e,t}(D)) \nonumber \\ O_a&= \sigma _{D.t="object"}(\varPi _{s,p,o}(D)) \nonumber \\ D_a&=\sigma _{D.t="datatype"}(\varPi _{s,p,o,d}(D)) \end{aligned}$$

where etspod represent entity, type, subject, property, object and datatype, respectively.

Then, we apply the property minimalization UDF function \(PM^{\mathrm{UDF}}\) to both \(O_a\) and \(D_a\) obtaining two new relations \(O^m_a, D^m_a\). The \(PM^{\mathrm{UDF}}\) function removes redundant assertions as P(xy) if exists an assertion Q(xy) with \(Q \preceq ^{P} P\).

The minimize udf \(M^{\mathrm{UDF}}\) is another UDF function, that takes as input an entity e along with its types and calculates the minimal types mt, and generates a new relation \(T^m_a(e,mt)\). The explode UDF function \(E^{\mathrm{UDF}}\) creates a new row for each type in the type attribute (see Fig. 5).

Minimal patterns are then calculated. For the sake of brevity, we describe the relational queries for object-relational assertions only, as for datatype relational assertions the process is similar. The relational query that calculates the minimal patterns MP relation for the object-relational assertions is defined by the queries in 2 and 3.


In Query 2 the left outer join () is applied on the object relational assertions \(O_a\) (subject attribute) with the minimal types table \(T_a^m\) (e attribute). This Cartesian product generates the new st attribute.

A similar procedure is applied for object attribute. While we join data, projection is applied by removing the subject and object attributes. Following, we rename the minimalType in subjectType as st, and the minimalType in objectType as ot. This produces a relation where each tuple represents a pattern with subject type st, predicate p and object type ot. Certainly this relation will contain duplicates. Observe that this relation contains already the minimal patterns. Hence, given a tuple, the number of its duplicates corresponds to the number of assertions it represents.

In Query 3 the group-by operator \(\varGamma \) groups duplicate patterns (same subjectType st, property p and objectType ot) and then counting is performed. The result is the frequency freq of a given pattern.

$$\begin{aligned} \hbox {MP}&= \rho _{\mathrm{freq} \leftarrow \mathrm{COUNT}(st, p, ot)} (\varPi _{st, p, ot, \mathrm{COUNT}(st, p, ot)}\nonumber \\&\quad (\varGamma _{st,p,ot}(q2)) \end{aligned}$$
Fig. 5
figure 5

The minimal types calculation

Queries 4 and 5 show cardinality descriptors calculation, where x stands for “MIN," “MAX," “AVG" operators. Let \(\hbox {AKP}_{\mathrm{SO}}(\hbox {AKP},s,o)\) be a relation schema with subject and object attribute and an AKP attribute. AKP attribute is obtained by string concatenation of the subject type, predicate and object type fields extracted by MP relation created in Query 3.

$$\begin{aligned} \hbox {AKP}^o&= \rho _{\mathrm{count} \leftarrow \mathrm{COUNT}(o)}(\varPi _{\mathrm{AKP},o, \mathrm{COUNT}(o)}\nonumber \\&\quad (\varGamma _{\mathrm{AKP},o}(\mathrm{AKP}_{\mathrm{SO}})))\nonumber \\ \hbox {AKP}^o_{x}&= \rho _{x_o \leftarrow X(\mathrm{count})}((\varPi _{\mathrm{AKP},o, X(\mathrm{count})}\nonumber \\&\quad (\varGamma _{\mathrm{AKP}}(\hbox {AKP}^o)))) \end{aligned}$$

As for direct cardinality descriptors, let us consider a pattern \(\pi \) and the set \({\mathcal {A}}^{P}_{\pi }\) of relational assertions it minimally represents. Please note that every assertion in \({\mathcal {A}}^{P}_{\pi }\) has the same predicate. For each object of \({\mathcal {A}}^{P}_{\pi }\) the number of distinct subjects linked through the same predicate is calculated, and afterward the max, min and average is computed (\(\hbox {AKP}^s_{\mathrm{min}},\hbox {AKP}^s_{\mathrm{max}},\hbox {AKP}^s_{\mathrm{avg}}\)). In a similar way, the inverse cardinality is calculated. For these statistics, the groupby occurs on the subject and counts the number of linked distinct objects. From this we extract the maximum, minimum and the average inverse cardinality.

$$\begin{aligned} \hbox {AKP}^s&= \rho _{\mathrm{count} \leftarrow \hbox {COUNT}(ss)}(\varPi _{\mathrm{AKP},s, \mathrm{COUNT}(s)}\nonumber \\&\quad (\varGamma _{\mathrm{AKP},s}(\hbox {AKP}_{\mathrm{SO}})))\nonumber \\ \hbox {AKP}^s_{x}&= \rho _{x_s \leftarrow X(\mathrm{count})}((\varPi _{\mathrm{AKP},s, X(\mathrm{count})}\nonumber \\&\quad (\varGamma _{AKP}(AKP^s)))) \end{aligned}$$

Query 6 shows the join operations between the relations calculated in Queries 4 and 5 that creates the final relation containing all cardinality descriptors associated with all patterns.

$$\begin{aligned} C&= \hbox {AKP}^s_{\mathrm{min}} \bowtie \hbox {AKP}^s_{\mathrm{max}} \bowtie \hbox {AKP}^s_{\mathrm{avg}} \bowtie \hbox {AKP}^o_{\mathrm{min}} \nonumber \\&\quad \bowtie \hbox {AKP}^o_{\mathrm{max}} \bowtie \hbox {AKP}^o_{\mathrm{avg}} \end{aligned}$$

Finally, the pattern inference and instances count step is very similar to the minimal types and pattern calculation but instead of making the Cartesian product between minimal types, relational assertions, and minimal types, Cartesian product is calculated between inferred types, relational assertions and inferred types. A UDF \(I^{\mathrm{UDF}}\) uses a terminology graph. Each type encountered will become one of the inferred types including "seed" types. Result of the \(I^{\mathrm{UDF}}\) function is then exploded by means of the \(E^{\mathrm{UDF}}\) producing the I(eit) relation where for each entity e the inferred type it is reported.


Following the same approach for pattern calculation, we have Queries 7 and 8 with the difference that the pattern generated will not be minimal anymore. In this case the frequency statistic that is calculated coincides with the number of instances statistic.

$$\begin{aligned} PI&= \rho _{\mathrm{freq} \leftarrow \mathrm{COUNT}(st, p, ot)}\nonumber \\&\quad (\varPi _{st, p, ot, \mathrm{COUNT}(st, p, ot)}\nonumber \\&\quad (\varGamma _{st,p,ot}(q4)) \end{aligned}$$

3.3 Complexity

In this section, we first calculate the time complexity of different stages of the workflow (Fig. 2) and then estimate the global complexity by considering the contribution of each stage to the whole workflow. Notice that the following assumption holds true that the workflow is implemented using the standard library Spark SQL provided by the Apache Spark distributed processing engine (more implementation details are given in Sect. 4.2.3). Consequently, for the complexity of select–project–join operations, we adapt the cost model described in [6, 27] with some simplification due to the fact that we use a purely cloud-based solution.

The first operation is the creation of relations described in Query 1 where three selection–projection sub-queries are performed. Let n be the number of triples stored in the relation D, and w be the number of workers (i.e., agents performing a the query in distributed fashion), the time complexity of Query 1 is \(\varTheta (n)\) i.e., the computation time linearly depends on the number of triples.

\(M^{\mathrm{UDF}}\) has a time complexity \(\varTheta (e)\) as it needs to populate the relation \(T^m_a(e,mt)\), while \(E^{\mathrm{UDF}}\) features a complexity \(\varTheta (\frac{mt}{e})\) because the function creates a new row if an entity e has more than one minimal type. Thus, the complexity for populating the relation \(T^m_a\) is equal to \(\varTheta (mt)\).

Query 2 comprises two left outer join queries. Subquery q1 is implemented in Spark SQL as SortMergeJoin to limit the memory consumption. According to [27], this join implementation has a time complexity equals to:

$$\begin{aligned} \varTheta (|O_a|,|t^m_a|)&= \varTheta _{s}(|O_a|)*\varTheta _{s}(|t^m_a|)*\nonumber \\&\quad \log \left( \frac{|O_a|}{w}\right) *\log \left( \frac{|t^m_a|}{w}\right) \nonumber \\&\approx \varTheta _{s}\left( \frac{n}{w}\right) *\varTheta _{s}\left( \frac{mt}{w}\right) *\log \left( \frac{n}{w}\right) *\log \left( \frac{mt}{w}\right) \end{aligned}$$

where \(\varTheta _s\) is the time complexity of the shuffle operation. Notice that the cardinality of q2 is the same of q1 (\(|q1|=|O_a|\)); thus, its complexity is (9) as well.

As described in Sect. 3.2, minimal patterns computation is created by means of Query 3 that uses a groupby operator. Grounding on the analysis of [6], we can assume that its time complexity is \(\varTheta (\frac{n}{w})\).

To calculate patterns cardinality descriptors, we use Query 4 and 5 that are aggregate and groupby queries over the relation \(\hbox {AKP}_{\mathrm{SO}}\). This relation has the same cardinality as the relation \(\hbox {MP}\) (\(|\hbox {MP}|\)). As a consequence, the time complexity for both queries is \(\varTheta (\frac{|\hbox {MP}|}{w})\).

The final step in the profiling workflow is the computation of inferred patterns and instance count. Query 7 and 8 have a similar structure, thus their complexity is \(\varTheta _{s}(\frac{n}{w})\varTheta _{s}(e)\log (\frac{n}{w})\log (\frac{e}{w})\).

When the dataset is very large, that is, when \(n\gg mt; n \gg e; n \gg |\hbox {MP}|\) we can conclude that the complexity of the overall profiling algorithm is:

$$\begin{aligned} \varTheta \left( \frac{n}{w}\log \left( \frac{n}{w}\right) \right) \end{aligned}$$

4 ABSTAT: highly distributed

This section presents the architecture of the ABSTAT-HD. We first describe the logical architecture behind the profiling process and then provide an analysis of the scalability issues that served as motivation for the distributed version. Following, we present the new distributed builder and its deployment.

4.1 Architecture

The diagram reported in Fig. 6 is to be considered a minimal representation of the ABSTAT logical architecture; thus, inessential components like the ones in charge of authentication and authorization activities are not reported. ABSTAT architecture is modular so that it can benefit from the Service Oriented Architecture model and the main components are:

  • Viewer, which provides a graphical user interface to interact with ABSTAT functionalities. A configuration wizard drives the user in the choice/upload of datasets/ontologies along with a configuration setup for further processing. Once the execution has ended, the user can explore computed profiles using the interface for constrained queries (requesting, for instance, the desired subject with or without a predicate and object) and full-text search. In addition, controls for the managing of profiles, datasets, and ontologies are provided.

  • Builder, which is the core module that executes the profiling algorithms. It takes as input a dataset (in N3 format) and possibly an ontology (in OWL format) along with the user’s profile configuration. The configuration received from the Builder contains all the user choices about which step to execute in the profiling pipeline. Both input and output profiles are saved in the Data Lake.

  • Data Loader, which main task is to feed the storage engines intended for user consultation. After a semantic profile is computed, the Data Loader reads the profile from the Data Lake (internal data model) and maps it in a suitable way for uploading into databases (e.g., MongoDB) or indexing into search engines (currently exploiting Apache Solr). Furthermore, it also creates a copy of the input datasets into Virtuoso triple-store.

  • Explorer, which exposes a collection of APIs to support profile exploration requests from Viewer or authenticated third-party applications. Examples of such APIs include the Browse API, which provides a subject (predicate, object) constraint consultation of the profiles, the Search API for full-text search functionality over patterns, concepts, and properties, the Autocomplete API for concept/predicate suggestion based on our patterns and, finally, the Validate API, which allows the user to inspect pattern instances with possible data quality issues.

Fig. 6
figure 6

ABSTAT architecture

The modularity of this architecture enhances the flexibility of the components during deployment and their maintainability, which are central features for further extensions.

4.2 ABSTAT-HD builder

In the following, we first discuss the main issues posed by the previous centralized architecture; then, we present the new distributed Builder and the Big Data framework adopted to support its deployment and execution.

4.2.1 Scalability issues

At this state, ABSTAT can compute profiles for small datasets and allows users even on commodity hardware to compute profiles for their confidential data. However, the complexity of the statistics to be calculated (especially the new ones, viz. cardinality and inference) makes ABSTAT unsuitable for processing complex and large datasets with many millions of triples. These considerations, combined with the awareness that the size and complexity of the LOD Cloud assets are continuously growing, led us to carry out a redefinition of the system aimed at seeking horizontal scalability.

Fig. 7
figure 7

ABSTAT-HD: architecture and deployment

By inspecting ABSTAT architecture, we identified two components that primarily influence the system scalability, namely, the BuilderFootnote 7 and the Data Loader. The former is involved in creating profiles; the latter writes and reads data from the Data Lake. The Data Loader scalability issues relate to the data ingestion process; they are not faced in this work as there are plenty of production-grade Big Data solutions for efficiently moving large amounts of data (e.g., Apache Flume).

As for the Builder and the related computation bottleneck issue, we worked to overcome the limitations of the centralized approach.

The Builder implements the workflow in Fig. 2. Each element of this pipeline is implemented separately and in a multi-threaded manner (but with centralized synchronization points); moreover, the code over the years has been optimized as much as possible. We discarded the hypothesis of reimplementing ABSTAT codebase in a more efficient programming language (like C) to reuse as much as possible the available code. We have also experimented with parallelizing the file scan (dividing it into chunks) to eliminate the bottleneck due to sequential disk access. Still, the results were not satisfactory due to severe disk contention. The use of increasingly powerful machines did not solve resource saturation for larger datasets, either. For this reason, it has been decided to re-design the Builder component according to the manager–workers model and execute it in a distributed fashion on a collection of machines, where the workers perform in parallel the computation while the manager supervises the execution. We preferred to use a mature Big Data solution, prized and actively developed to implement this approach, that guarantee maintainability, flexibility, security, and high-level languages to describe the processing pipeline. In particular, these tools offer an off-the-shelf replicated and distributed data lake, the management of computational resources, seamless data shuffling, automatic application deployment, data locality aware task scheduling, and a workflow optimization mechanism. More details on the Big Data environment underpinning ABSTAT-HD are reported in Sect. 4.2.3

Fig. 8
figure 8

An overview of ABSTAT-HD Builder’s internals

4.2.2 Extended components

ABSTAT-HD (whose general architecture and deployment are depicted in Fig. 7) addresses and solves the limitations of the previous version. As aforesaid, we decided to exploit a Big Data environment to manage a cluster of machines and run the computation in a distributed fashion. Such an environment consists of an ecosystem of several interacting components; for simplicity, we will only mention the Data Lake, the Resource Manager, and the Application Framework. Data Lake is in charge of partitioning, distributing, and managing datasets to increase locality and reduce skewness during processing. Resource Manager is responsible for managing, partitioning, and assigning cluster resources to the applications that require them. Finally, Application Framework consists of a set of classes and libraries to create applications compatible with the particular platform, i.e., communicating with Resource Manager to request the necessary resources and control the program execution flow. Consequently, we adapted some components (namely, Builder and Data Loader) and created a new one (Submitter) to interact with Resource Manager. Other architectural components, like Viewer and Explorer, are kept unchanged. Follows a brief description of those components:

  • Submitter is a new component. It is a service that implements the interface of the old Builder. Its job is to receive requests from Viewer and submit the Builder executable to the cluster through Resource Manager (green arrow 1 in Fig. 8). Resource Manager distributes the Builder code to the cluster nodes (green arrows 2 in Fig. 8) and executes it. Submitter also checks on the Builder status and exposes it via an API.

  • Builder has been completely reworked; it is no longer a long-running service but a manager–worker distributed application executing for the time strictly necessary to calculate the profile of a single dataset. The main component of the Builder new architecture (Fig. 8) is the Application Master, which is in charge of task management. The worker component, called Agent, in turn receives and executes tasks on the dataset.. More in detail, the Builder Application Master interacts with Resource manager (blue lines in Fig. 8) to negotiate the access to resource containers (RAM and CPU shares) on the cluster nodes. Within each container, a Builder Agent is then executed. Each Agent receives from the Application Master a set of tasks to perform on specific dataset chunks; partial and final results are stored in a cache structure, but they can be spilled on disk if necessary. As for the Application Master, this consists of a set of modules, among which the Driver stands out, which is in charge of managing the interface with Resource Manager and Agents. It implements the summarization and statistics calculation algorithms exploiting a data frame abstraction and an SQL-like engine over it (both provided by the Application Framework); thus, a dataset is view as a relational table that can be manipulated using relational-algebra operators (Expressions (1) to (8)). A query optimizer module manipulates relational queries (using techniques such as filter pushdown, indexes, bucketing, join type selection, among others) to execute them more efficiently. The resulting query is then compiled into tasks forming a Direct Acyclic Graph (DAG); such structure is analyzed to identify tasks that can be performed in parallel (stages) and data shuffle operations. Eventually, groups of tasks are sent to the Builder Agents by the TaskScheduler module for execution (purple lines in Fig. 8). A data shuffle operation is performed at the end of each stage; this is done by sorting local data and distributing them to the other agents according to a partition key (red lines in Fig. 8). Ultimately, the results are returned to Builder Application Master or persisted by Data Lake. Finally, note that the Application Master is enforced to run in a separate node from those of the Agents, co-located with Submitter and Data Loader to reduce skewness.

  • Resource Manager is a component provided by the Big Data environment. It is in charge of managing the available resources of the underlying cluster, considering several factors ranging from data locality to cluster-level load balance. An application running into the cluster has to interact with this component to access the required resources.

  • Data Loader is made compatible with the distributed Data Lake (e.g., HDFS) to both write the datasets and read the profiles to be indexed.

4.2.3 Big data environment

Our choice for the ecosystem/framework to achieve horizontal scalability fell on Apache HadoopFootnote 8 with Apache SparkFootnote 9 as processing engine and application framework. The Apache Hadoop stack enables the distributed processing of large datasets across computer clusters using high-level programming models. It is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers. The data layer of Apache Hadoop is the Hadoop Distributed File System (HDFS). HDFS splits files into large blocks and distributes them across the cluster nodes. Hadoop resource manager, YARN, then transfers the application code to the nodes for parallel execution. This approach is data locality aware that is, nodes mainly manipulate the data they have direct access to.

Apache Spark has been selected as a distributed computing framework since, unlike the default compute engine Apache MapReduce, it uses node memory better, reducing disk spill and achieving reduced computation times.

Moreover, since the RDF data model can be easily mapped onto a relational table (data frame) with columns “subject,” “predicate,” “object” and optionally “datatype” (as described in Sect. 3.2), Spark application framework comes in handy as it allows the application to process a dataset in a (distributed) relational fashion, simplifying by far the Builder component development. In particular, a very convenient component of Spark is the Spark SQL API [4], which offers an SQL interface over (semi)structured data. The adoption of a relational-based approach for profile creation enabled us to implement the Builder logics using nothing else than SQL expressions (derived from (1) to (8)), which allowed us to obtain, on the one hand, a robust, compliant with best practices, maintainable and efficient application (highly optimized code is generated by the Catalyst optimizer [4]) and, on the other hand, not to have to deal with headaches typical of distributed data processing systems such as code generation/optimization, data locality aware task distribution, skewness, and resilience management.

5 Evaluation

This section presents the experiments to evaluate ABSTAT-HD performance. Firstly, the experimental setup constituted by the datasets, the environment for the experiments and the workload are introduced. Secondly, we present the performance analysis considering different configurations of the experimental setup and, finally, we discuss the results for each configuration along with a detailed report on the potential errors detected in the Microsoft Academic Knowledge Graph.

5.1 Experimental setup

The experimental setup of ABSTAT-HD is in line with the setup used in the only other approach proposed in the state-of-the-art to distribute the computation of knowledge graph profiling, namely, DistLODStat [43], where the scalability of the distributed and centralized version of the same systems are compared.

5.1.1 Datasets

For our experiments we use two very large and famous datasets: DBpedia and Microsoft Academic Knowledge Graph. DBpedia is one of the most important datasets of the Semantic Web community as it contains real, large-scale data and is complex enough with 760 types and 2865 properties. It has a documented schema which might be downloaded easily.Footnote 10 For DBpedia we consider three versions with different size: dbp-2014\(_{566M}\)Footnote 11 (full dataset), dbp-2015\(_{47M}\)Footnote 12 (the following chunks: types, mapping-based literals, objects and properties about person data only), and dbp-2016\(_{2.75B}\)Footnote 13 (all available chunks except for those with the label *sorted).

The second dataset is Microsoft Academic Knowledge GraphFootnote 14 (makg in the following). We considered such dataset as it a very large KG, containing information about scientific publications and related entities, such as authors, institutions, journals, and fields of study. It contains 8 types and 57 properties, thus its schema is not as complex as DBpedia. In order to have datasets with different size but same complexity, we created the following samples: makg\(_{2.11B}\) (including the following chunks: Authors and Paper Authors Affiliations), makg\(_{2.39B}\) (including only Papers), makg\(_{6.36B}\) (all chunks except of Abstracts, URLs and Paper References) and makg\(_{7.74B}\) (all chunks except of Abstracts and URLs). Finally makg\(_{8.19B}\) represents the full dataset.

The list of used datasets and their respective statistics about size in terms of GB and number of triples and number of types/entities is shown in Table 1.

Table 1 Evaluation datasets

5.1.2 Experimental setting

The experiments reported in this paper have been performed by deploying the ABSTAT-HD Builder component on a Microsoft Azure Virtual Machine (VM) cluster. In particular, the cluster consists of Standard_D13_v2 VMs featuring 8 virtual CPUs and 56GiB of RAM; one VM (replicated for availability) acts as a manager node while the number of worker nodes is varied from 1 to 5 (with autoscale disabled). The cluster runs the Azure HDInsightFootnote 15 platform, based on Apache Hadoop 3.1 and Apache Spark 2.4. Regarding the data store, the cluster uses Azure Blob Store, which also implements the HDFS API. The resource manager is Apache YARN, which has been configured to run a single queue of jobs to execute; in this way, a job can use all cluster resources. Every other configuration has been left with default values. In addition a VM with Standard_D13_v2 configuration and 1TB HDD was used for comparisons with the original ABSTAT builder.

The campaign of experiments aimed to prove ABSTAT-HD scalability has been carried out considering all datasets, the two workload (core-profiling and full-profiling) and varying the worker nodes in the set \(\{1,2,4,5\}\). Each experiment, identified by the triple (# nodes, core/full-profiling and dataset) has been repeated three times and the average time calculated.

Besides proving the scalability of the new Builder, we also show the importance of the pattern minimalization on type and properties. To achieve this, we compare a full-profiling process and a full-profiling process with no minimalization in terms of execution time and the number of patterns generated. Furthermore, a short experiment has been carried out on the original ABSTAT Builder implementation to assess the scalability.

5.1.3 Workload

Our experiments aim to evaluate the ABSTAT-HD performances in three main orthogonal dimensions: (i) size, complexity, and profiling types.

The dataset size is considered a critical property in determining the performance of any profiling tool. Typically, large datasets need more time to be processed and small datasets may need less time. Despite the continuous debates and efforts, there is still no agreed definition of what constitutes a small dataset. In this paper, we do not give a definition about the dataset size (i.e., we do not categorize datasets as small, medium, and large) but consider dataset with increasing number of triples. The smallest dataset with respect to the number of triples is a subset of DBpedia 2015 having only \(\sim 47M\) triples while the biggest dataset is the full version of makg having \(\sim 8.2B\) triples (Table 1).

The second dimension considered in our experiments is the complexity of the dataset. For the purpose of the summarization approach, the complexity of a dataset is measured in terms of different features that affect different phases of the generation of the profile. For this dimension we consider: (i) the number of distinct entities which affects the cardinality descriptors computation, (ii) the number of types per entity which affects the number of the minimal pattern base, and (iii) the ontology features like the number of types (subclass) and properties (subproperties) relations which affect the minimal types calculation and inference calculation. Table 2 outlines the above dimension for all datasets.

Table 2 Evaluation ontologies

Notice that the number of types per entity for each dataset is very useful to evaluate the performance of ABSTAT-HD with respect to the complexity of the dataset that is being profiled. Observe that versions of DBpedia are more complex than makg ones. The terminology graph of DBpedia has more types, properties and the average length of its taxonomy branches is greater than makg terminology graph. In fact, makg ontology has no subtype and subproperty relations. Moreover, the overall number of types and properties is almost 40 times smaller for makg. Therefore, all above reasons make DBpedia datasets more complex. Finally, the third dimension considers the processing load which is addressed by including different set of profiling features to compute. The whole profiling pipeline is presented in Fig. 2. We consider two profiling configurations: (i) core-profiling which includes features such as minimalization on types and frequency statistics for types, predicates and patterns and (ii) full-profiling (the whole set of features). In this way, we created two processing loads that require different efforts.

5.2 Performance analysis

In this section, we analyze how the scalability of ABSTAT-HD is affected by the three considered dimensions (size, complexity, and profiling type).

Figures 9 and 10 show how the size of the dataset affects the scalability. In both figures each curve corresponds to a different sample of the two selected KGs. The figures show how ABSTAT-HD performance changes as a function of the number of worker nodes. It is clear that the time required to complete the computation halves if the number of worker nodes doubles. In Fig. 9 the dbp-2014 requires \(\sim 963\) min to be profiled using one worker node, \(\sim 47\) min on 2 worker nodes, \(\sim 253\) min on 4 worker nodes and, finally, \(\sim 20\) min on 5 worker nodes. This behavior is maintained regardless of the type of profiling selected and the complexity of the dataset. The same trend is shown in both core-profiling (Fig. 10) and full-profiling (Fig. 11) plots for both DBpedia and makg. These experimental results confirm the complexity analysis shown in Sect. 3.3.

Fig. 9
figure 9

Scalability on worker nodes for DBPedia (core-profiling)

Fig. 10
figure 10

Scalability on worker nodes for MAKG (core-profiling)

Fig. 11
figure 11

Scalability on worker nodes for MAKG (full-profiling)

The comparison of the performance of the same sample of makg considering the two different types of profiling is shown in Fig. 12; it is evident that the full-profiling requires more time than the core one. This behavior can be explained by considering that in the full-profiling a bigger set of statistics is calculated for each minimal pattern according to Queries 5 and 8.

Fig. 12
figure 12

Difference of profiling types

To assess the scalability of ABSTAT-HD in function of the dataset complexity, we consider dbp-2016\(_{2.75B}\) and makg\(_{2.39B}\). These two samples have approximately the same number of triples but have very different complexity (Table 2). Figure 13 reports the scalability of two datasets in function of the number of worker nodes. Notice that the complexity of dataset effects the slope of the curve: the more complex the dataset the worse the performance, but in any case curve still shows a good scalability. As a last remark, for big and complex dataset such as DBpedia it is impossible to compute any type of profile in a non-distributed way. Our experiments show that to profile such type of datasets at least 4 worker nodes are needed.

Fig. 13
figure 13

Complexity of datasets and scalability

Fig. 14
figure 14

Size scalability for DBPedia and makg (core-profiling)

Figure 14 shows the performance of ABSTAT-HD with respect to the size of the datasets on core-profiling computed on 5 worker nodes. The time has a linear behavior with respect to the size dimension. However, despite the complexity of DBpedia (the blue curve) the time needed is similarly linear to the makg. This means that the complexity relevance declines while working on many nodes as already defined in Sect. 3.3

To conclude, the analysis about the scalability shows that ABSTAT-HD scales well considering all three dimensions (size, complexity, and type of summarization).

Another important question to answer for a better comprehension of ABSTAT-HD is to identify what are, among the three dimensions, the ones that are more time consuming in the summarization process. By analyzing results reported in Figs. 9 and 10 it is possible to confirm that:

  1. 1.

    Concerning the size of the dataset, given a fixed number of worker nodes (e.g., 4 or 5) there is a linear correlation between the number of triples to analyze and the requested time to process them.

  2. 2.

    The complexity of the dataset seems to not impact the performance. In fact, by considering the sample ma-graph\(_{2.39B}\) and dbp-2016-10 (see Table 2), despite the fact that the later has an ontology that is 40 times bigger, the time needed for the full-profiling with 5 worker nodes is only about two times greater than the corresponding time of ma-graph\(_{2.39B}\).

  3. 3.

    Independently from the size and complexity dimensions, the full-profiling of a given dataset requires up to 3 times more time than the core-profiling. This is due to the fact that, for the full-profiling a greater set of statistics is computed (Queries 5 and 8).

5.3 ABSTAT-HD versus related work

This section analyzes the impact of minimalization on execution time and number of generated patterns. To this end, we compare ABSTAT-HD and ABSTAT over selected tasks and datasets.

The main hypothesis behind the introduction of minimalization is that, when the input KG includes several assertions inferred from deep hierarchies of types and properties defined in the ontology, minimalization reduces significantly the number of patterns. We test this hypothesis by executing full-profiling with and without type and property minimalization on the dbp-2014\(_{566M}\) KG; in the first case, the number of generated patterns is 1.636.629 and the time required for the process to complete is 71.3 min while in the second scenario we get 2.919.869 patterns calculated in 94.4 min. Thus, it appears evident that minimalization has a significant impact, halving the number of generated patterns and speeding up the workflow by a 31%.

Table 3 reports the results of another experiment in which we have run ABSTAT-HD and ABSTAT in a single node cluster. Results shows that in the same conditions, ABSTAT-HD presents a significantly better performance than ABSTAT.

The average ratio between time elapsed for full-profiling and core-profiling in ABSTAT-HD is roughly \(3{\times }\), while for ABSTAT is about \(42{\times }\). Furthermore, we can see that for core-profiling ABSTAT-HD can be up to \(\sim 9{\times }\) faster and for full-profiling can be up to \(\sim 35{\times }\) faster than ABSTAT.

We have further compared ABSTAT-HD with state-of-the-art approaches for which similar settings were used in the original papers. In fact, the experimental settings of ABSTAT-HD are in line with those used in the only other approaches, namely, DistLODStat [43]. Even though DistLODStat and ABSTAT-HD have different scopes (the final output does not provide the same information) and cannot be directly compared, we provide a comparison in terms of size of the datasets that are processed in the experiments. The largest dataset handled by ABSTAT-HD (makg—1.183 GB) is much larger than the datasets considered by the approaches proposed in the literature. In particular, according to what has been reported in [43], it is approximately 6 times larger than the more significant dataset processed by DistLODStat (i.e., 200 GB). Other profiling approaches, such as [20], experimented with real and synthetic graphs of up to 36.5 GB (approx 32 times smaller than makg), while [17] is evaluated on 6 datasets where the biggest one has the size of 56 MB (approx 21.125 times smaller than makg).

Table 3 Performance comparison between ABSTAT and ABSTAT-HD (1 node). Time is expressed in minutes

5.4 Discussion on the results

Despite the lack of some data points for the heaviest computations is still clearly visible in Figs. 1011 and 14 that the trend of the performance is a linear function of the dataset number of triples (dataset dimension). Nonetheless, there are some interesting highlights. First, the rapidly increasing slope curve once over 8 billions of triples in core-profiling (Fig. 10) and full-profiling (Fig. 11) indicates that we have reached the limit of the cluster capabilities for any number of nodes. In particular, during the summarization, ABSTAT-HD performs large joins. Especially when such joins are performed on tens of billions of triples, Spark workers write intermediate data on the disk as it shuffles. In case the disk space is not enough it throws an error. This is reflected on dbp-2016\(_{2.75B}\) dataset in Fig. 9 on which it was not possible to compute core-profiling with only one worker node. Instead, for the makg\(_{6.36B}\) dataset even though the number of triples doubles but its terminology graph is more simple, this is possible. Also in this case, the join cost makes the difference: in DBpedia dataset, joins are more expensive since an ontology with many types, in general, leads to more minimal types of entities. The effect of the dataset complexity can be furthermore noticed by the slope gap trend in Fig. 13. DBpedia takes more time to be processed and this is more evident as the processing load dimension increases. In fact, for all datasets, the slope gap in pairs of curves is higher on full-profiling. In conclusion, when regarding performance, dataset complexity is not a concern if the disk space is big enough to support a large number of joins.

We can obtain useful observations by using the results plots in a more practical way. Let’s consider two use cases:

  1. (i)

    A user has already deployed ABSTAT-HD in a n-nodes cluster and wants to know its performance if the input dataset increases. Figures 10 and 11 show how the time needed to profile datasets changes with respect to the dataset size for a given configuration and processing load.

  2. (ii)

    A user has already deployed ABSTAT-HD in n-nodes clusters and wants to profile datasets that have similar size. She/he wants to know how the computation performance changes with respect to the number of nodes. Figures 10 and 11 show that despite the size of the dataset, the cluster would perform t for one node (\(n=\textit{1}\)), while t/n for n nodes. Therefore, \(t_{(s,n)} = (t_{(s,1)}/n)\) where s is the dataset size and n is the number of nodes.

Concerning the impact of minimalization, experiments on dbp-2014\(_{566M}\) demonstrate that including the minimalization within the overall profiling process leads to a better MPB compression and a reduction in computation time. Minimalization is very effective in pruning the pattern space when the terminology graph \(G^{{\mathcal {T}}}\) is rich in types, properties, and subclass/subproperty relations, and when entities have multiple types, many of which are redundant. Furthermore, reducing the number of minimal patterns for which frequency and cardinality descriptors are computed reduces also the execution time and memory usage for computing these statistics. Queries (2) and (4) show the relation tables for frequency and cardinality calculation where table dimensions depend on the number of patterns, types, and types per entity, thus in cases where a cluster has reached its maximum in memory capacity by executing full-profiling, minimalization can undoubtedly help to reduce the number of patterns and make the whole computation more suitable. Therefore, we can conclude that minimalization reaches the maximum effect on KGs that include the transitive closure of types on typing assertions and the transitive closure on properties on relational assertions (thus having multiple redundant relational assertions), and use rich ontologies.

As reported in Table 3, the large gap in execution time between core-profiling and full-profiling for ABSTAT is caused by intense I/O, sorting, and bucketing operations for instance count and cardinality calculation (which are present only in the full-profiling workload). It is also evident that ABSTAT-HD is much faster than ABSTAT in both workloads, arguably due to the in-memory distributed computation and query optimization offered by the Spark Framework.

5.5 Potential errors detected in the MAKG

This section summarizes some of the potential errors detected in the makg exploring the profile produced by ABSTAT. As from Table 1, such KG has 57 properties and 8 types defined within the ontology of makg.Footnote 16 Moreover, it uses 5 external types from the fabio ontologyFootnote 17 (Book, BookChapter, ConferencePaper, JournalArticle, and PatentDocument) and 25 external properties (from ontologies fabio, purl,Footnote 18 cito,Footnote 19 dbpedia, etc.). The Microsoft Academic Knowledge Graph maintainers have published also the schemaFootnote 20 as an easier way to visualize relations among types and datatypes. From this schema, a user can easily notice that the KG makes use of two owl:equivalentClass: one between makg:FieldOfStudy and fabio:SubjectDiscipline and the other between makg:Paper and fabio:Work. However, both relations are present only in the schema depicted in their website, but they are both missing in the owl ontology. All the external types used in the dataset from fabio ontology (Book, BookChapter, ConferencePaper, JournalArticle, and PatentDocument) are subtypes of the class fabio:Expressions. Intuitively, such types refer to subtypes of Paper, that in the fabio ontology are under fabio:Expressions not under fabio:Work. Thus, we can deduce that there is a wrong equivalent relation between makg:Paper and fabio:Work. Instead, the equivalent relation should be between makg:Paper and fabio:Expressions.

A second problem that clearly emerged thanks to the patterns produced by ABSTAT is related to the domain and range restrictions. The predicate makg:citationCount has as defined domain in the ontology the type Author while as range an integer. However, its usage in the dataset does not respect such definition. Indeed, there are 12 patterns in the data that have makg:citationCount as predicate. Such patterns take in the subject position types such as makg:Author, makg:Affiliation, makg:ConferenceInstance,makg:ConferenceSeries, makg:FieldOfStudy,makg:Journal, makg:Paper, makg:Book, makg:BookChapter,fabio:ConferencePaper, fabio:JournalArticle andfabio:PatentDocument. Even though, such predicate should be used in the data only with the type Author as stated in the ontology, it is used also with other types that are not in a subtype relation with the type Author, e.g., Affiliation. There is no subtype relationship between affiliation, author, conference instances, conference series, field of study, journal and paper. So here, we can deduce that either a concept that is superconcept of all the above ones is missing or the domain for this property should be better defined in the ontology. The similar potential error is also identified for the predicates makg:bookTitle, makg:paperCount and makg:rank.

The third problem is related to the cardinality values for some predicates. With ABSTAT, we were able to identify several patterns for which the cardinality values seem to identify possible errors in the data. For instance, the pattern makg:Paper purl:creator makg:Author occurs 549, 142, 397 times in the data and has the maximum subject–objects cardinality equal to 6760. This means that at least one paper has as creators 6760 different authors. This number exceeds the usual number of authors per paper, thus it might indicate a potential error in the data. Similarly, the pattern makg:Paper cito:cites makg:Paper has as maximum subject-objects cardinality equal to 27, 036. This means that a given paper cites up to 27, 036 different papers. Although we can not say that this is an error, in practice, papers cite up to 100 other papers. Moreover, as we can see from the statistics that ABSTAT produces, the average number of cited paper for such pattern is equal to 20, thus having a cardinality with value greater than 27 thousand may indicate an error in the data. Other patterns that might indicate quality errors in the data are: papers that have 21 different languages (makg:Paper purl:language xmls:language), journal papers that have 17 different disciplines (makg:JournalPaper fabio:hasDiscipline makg:FieldOfStudy), journal papers that have up to 329 different URLs (makg:JournalPaper purl:hasURL owl:Thing), 5 different affiliations have the same homepage (of type owl:Thing), etc.

6 Related work

This section gives an overview of state-of-the-art approaches that summarize Knowledge Graphs (Sect. 6.1) and approaches that have adapted distributed technologies to improve the scalability of processing graphs (Sect. 6.2).

6.1 Knowledge graph profiling

RDF graph profiling has been intensively studied, and various approaches and techniques have been proposed to provide a concise and meaningful representation of the RDF KGs. There are different recent surveys that discuss some of the approaches to profile knowledge graphs such as [9, 44, 54]. Most of the work on KGs profiling has been done in the field of KG summarization, which has been extensively surveyed in [9]. However, the related work discussed in this section are different as we focus not only on the summarization approaches but also on profiling ones.

Loupe [28] is the approach most similar to ABSTAT. It extracts patterns that describe relations among types, along with a rich set of statistics about their use within the dataset. The triple inspection functionality provides information about triple patterns (of the form \(<subjectType, property,\) \(objectType>\)) that appear in the dataset and their frequency. Loupe extracts also other information such as the namespace used in the dataset. Differently from ABSTAT, Loupe does not adopt a minimalization technique, thus, Loupe’s profiles contain many more patterns and consequently they are not as compact as ABSTAT profiles.

A data graph summary that assists users in formulating queries across multiple data sources by considering vocabulary usage is proposed in [8]. This approach extracts clusters called “node collections” to group a set of similar concepts and properties. The final aim of the paper is to help users into efficiently formulating complex SPARQL queries. For this reason a component called Assisted SPARQL Editor is developed. Similarly, ABSTAT patterns also help users formulate SPARQL queries as they encode useful information to understand the structure of the data [46]. Differently, ABSTAT does not group nodes with similar characteristics and does not have an interface to help users formulate SPARQL queries (for this task, users can use the endpoint of the dataset itself, e.g., http://dbpedia.org/sparql).

In [13] structural summaries are constructed by using bisimilarity to group nodes of a dataset as the notion of equivalence with the aim to provide users a summary-based exploration. Such is the backbone of S+EPPS where summaries are constructed of blocks and each block represents a non-overlapping subset of the original dataset. Blocks are connected by edges that summarize the relationships between dataset nodes across blocks (e.g., :person, :city, :location, etc.). ABSTAT does not use bisimilarity and does not extract summaries blocks but instead uses minimal type patterns to construct its summaries.

Structural equivalence is considered in [41] to provide users a summary that has a reduced size with respect to the KGs itself. This approach summarizes structural similar subgraphs by considering them to be bisimilar if they cannot be distinguished by their outgoing paths. Additionally, ASGG, proposed in [52], uses structural similarity for summarizing knowledge graphs. ASSG’ summary is constructed by considering equivalence classes by bisimulation relations and it has the adaptive ability to adjust its structure according to different query graphs. Similarly to the above approaches, ABSTAT profiles are compact with respect to the size of the KGs but differently, it does not consider the structural similarity of graphs.

The semi-structured data summarization approach proposed in [10] is query-oriented and it has a very high computational complexity. The summary enables static analysis and helps formulate and optimize queries. The scope is to reflect whether the query has some answers on this graph, or to find a simpler way to formulate the query. Similar to ABSTAT, information that can be easily inferred is excluded from the summary.

Other approaches consider pattern mining to summarize KGs [3, 10, 38, 44]. Summarizing entities considering their neighborhood similarity up to a distance d is the aim of [44]. Users might specify a bound k as the maximum number of the desired patterns to be included in the summary. The k d-summaries/patterns are chosen to satisfy and maximize informativeness (the total amount of information; entities and their relationships in a kg) and diversity (cover diverse concepts with informative summaries).

A weighted summary composed of supernodes connected by superedges as a result of the partitioning of the original set of vertices in the graph is proposed in [38]. Edge densities between vertices in the corresponding supernodes are considered as weights. A reconstruction error is proposed to introduce the error for the dissimilarity between the graph and the summary. ABSTAT approach does not consider edge densities.

RDF graphs might be more comprehensible by reducing their size as proposed by [3]. Size reduction is a result of bisimulation and agglomerative clustering (one of the most common types of hierarchical clustering) which discovers subgraphs that are similar with respect to their structure. ABSTAT does not use clustering but instead reduces the number of patterns to be explored by adopting a minimalization technique.

There is a bunch of work that summaries KGs quantitatively to represent the content of the RDF graph such as [5, 7, 16, 18, 19, 26].

SPADE allows exploring summaries through the prism of interesting aggregate statistics [16]. It uses OLAP-style aggregation to provide users with meaningful content of an RDF graph. Users may refine a given aggregate, by selecting and exploring its subclasses. The aggregation is centered around a set of facts, which are nodes of the RDF graph. LODSight [18] is a web-based tool that displays a summary based on type-property and datatype-property paths. The tool visualizes classes, datatypes and predicates used in the dataset with the aim to help users to quickly and easily find out what kind of data the dataset contains. It also shows how vocabularies are used in the dataset. This tool is similar to ABSTAT but it does not extract minimal types and is not maintained any more.

LODOP is a framework for executing, optimizing, and benchmarking profiling tasks in Linked Data [19]. These tasks include the calculation of: number of triples, average number of triples per resources/ per object URI, number of properties, average number of property values, inverse properties, etc.

Thirty-two different statistical criteria for RDF datasets can be obtained by LODStats profiling tool [5]. These statistics describe the dataset and its schema and include statistics about the number of triples, triples with blank nodes, labeled subjects, number of owl:sameAs links, class and property usage, class hierarchy depth, cardinality descriptors, etc. These statistics are then represented using Vocabulary of Interlinked Datasets (VOID) and Data Cube Vocabulary.Footnote 21

Several algorithms to compute different profiling, mining, or cleansing tasks [1] are implemented in a web browser tool called ProLOD++. The profiling task includes the calculation of: frequencies and distribution of distinct subjects, predicates and objects, range of predicates, etc. ProLOD++ can also identify predicates combinations that contain only unique values as key candidates to identify entities distinctly.

RDFStats generates statistics for datasets behind SPARQL endpoint [26]. These statistics include the number of anonymous subjects and different types of histograms; URIHistogram for URI subject, and histograms for each property and the associated range(s). It also provides the total number of instances for a given class or a set of classes and methods to obtain the URIs.

Differently from the above approaches, ABSTAT does not use aggregation methods for different summary resolution. Instead, it uses a terminology graph to extract only those patterns that describe relationships between instances of the most specific types.

6.2 Scalable graph processing

Graph processing approaches can be divided into two major categories: (1) centralized (storing the KG as a single node) and (2) distributed (distributing the KG among multiple cluster nodes). In this section, we focus only on the second category. Scalable graph processing has been reviewed recently by [2, 32, 33, 50]. Most of the approaches might be categorized by their main purpose such as; data storage, indexing, query languages and query execution. These purposes are orthogonal, thus, a work may be classified in multiple categories.

SANSA is a graph processing tool that has adopted distributed technologies to enhance scalability [25]. It provides a unified framework for several applications such as link prediction, knowledge base completion, querying, and reasoning. It is built upon general-purpose processing engines such as Apache Spark and Apache Flink. Similar to ABSTAT, the architecture of SANSA is also modular where each component has its own functionality. Among the main functionalities of SANSA are: read and write native RDF or OWL data from HDFS; supports different RDF and OWL serializations; provides different partitioning strategies (semantic-based, vertical, and graph-based partitioning); it computes several RDF statistics (such as the number of triples, RDF terms, properties per entity, and usage of vocabularies across datasets), and apply quality assessment in a distributed manner.

Entity Aware Graph compression technique (EAGRE) [53] proposes a new representation of RDF data on Cloud platforms with the aim to efficiently evaluate SPARQL with sequence modifiers such as projection, order by, etc., as quickly as possible. Such approach stores RDF data in HDFS in a (key, value) form. Entity graph is partitioned among worker machines using an indexing structure that adopts a space-filling curve technique used to index high dimensional data. The minimization of the input and output costs for SPARQL query processing is achieved by efficiently distributing schedules. The scope of this approach is to reduce the reading of data blocks which should be read for query evaluation.

Trinity.RDF [51] is a distributed in-memory RDF system that stores RDF data in its native graph form (i.e., representing entities as graph nodes, and relationships as graph edges). Each entity is stored giving a unique id as a key and as a value an adjacency list with incoming and outgoing edges. Such values contain predicate and node id of the connected nodes. Representing the graph in this way leads to optimization for SPARQL query processing, but also supports more advanced graph analytics on RDF data. Trinity.RDF uses efficient in-memory graph exploration instead of join operations for SPARQL processing. A SPARQL query is decomposed into a set of triple patterns, where for each pattern firstly matches are found, and then starting from these matches graph is explored. The exploration-based approach allows to perform exploration in parallel, thus saving time.

Similarly to [51] also Triple Asynchronous and Distributed (TriAD) [23] uses graph-exploration strategies based on Message Passing. It adds a multi-threading layer for the paths of a query plan that allows the execution in parallel. TriAD produces a summary graph using bisimulation (where only the predicates of the query triple patterns are labeled with constants) and locality-based summaries (where nodes that share some neighbors are spread across the partitions). This has the aim to index compact synopses of the data graph. A SPARQL query usually involves finding and connecting different parts of a graph, thus such approach works as it prunes. Since SPARQL typically involves finding connected components of the data graph, locality-based approaches are particularly effective in pruning part of triple patterns are labeled with constants.

SparkRDF is an RDF graph processing engine that implements SPARQL query on Spark that has the aim to reduce the high I/O and communication cost [11]. The graph is divided into multi-layer elastic subgraphs based on classes and relations. Spark APIs are employed and an iterative join operations with distributed memory, to minimize the cost of intermediate results to perform subgraph matching by triple patterns.

SemStore uses a Rooted Sub-Graph as the partition unit to partition and store the data with the aim to efficiently localize the four common types of SPARQL queries (SELECT, ASK, DESCRIBE, and CONSTRUCT) [49]. A k-mean partition algorithm is used to avoid redundancy and localize better the query types to a cluster nodes. The architecture of SemStore is master-slave where queries are submitted to the master while the slaves contain local indexes and statistics that will be used during join processing.

S2RDF partition RDF data by using ExtVP (Extended Vertical Partitioning) that uses a semi-join-based preprocessing, similar to the Join Indices in relational databases, to efficiently minimize the query input size regardless of its triple patterns [42]. Such partitioning considers the position of a joint variable that occurs in both triple patterns to determine the columns on which tables must be joined. In terms of updates, insertions and deletions, the first two are performed quickly by appending new triples to ExtVP tables while deletion are a bit more complicated and not so quick.

PRoST [14] (Partitioned RDF on Spark Tables) is a system that stores RDF data in a graph form using hash partitioning. It combines the Vertical Partitioning (VP) approach with the Property Table (PT), to translate SPARQL queries into Spark execution plans. The Vertical Partitioning is used to create a table for each distinct predicate of the input graph, containing all tuples (subject, object) that are connected by that predicate. The Property Table consists of a unique table where each row contains a distinct subject and all object values for that subject, stored in columns identified by the property to which they belong. For the query optimization, it uses Join Trees guided by simple statistics to translate SPARQL queries. The triple patterns that have the same subject are grouped together as a node and a special label is assigned to it (using Property Table), while all other groups with a single triple pattern are translated to nodes (using Vertical Partitioning).

Leon [22] is a distributed RDF system, which mitigates the multi-query problem. It uses a partitioning scheme based on characteristic sets that aims to capture the structure of the dataset and detects common sub-structure efficiently and effectively in a batch of SPARQL queries. The initial cost of such partitioning is very low. RDF strings are encoded into numerical IDs and a bi-directional dictionary is built which stores the ids of characteristic set and subjects. This dictionary is used afterward as an index for optimizing queries.

7 Conclusions

Processing and profiling big knowledge graphs can be a complex and challenging task but it is becoming increasingly important when KGs are used for machine learning activities. In this paper, we present ABSTAT-HD a minimalization-based profiling tool able to provide a profile for very large knowledge graphs. The modular architecture of ABSTAT allows to benefit from the advantages of distributed computing. Given the limitations on the previous version, ABSTAT-HD scales horizontally by adopting technologies such as Apache Hadoop and Spark that allow the distribution of the processing load of large datasets across clusters of computers using simple programming models. Thanks to the ability to detect and handle failures at the application layer, Apache Hadoop delivers a highly available service on top of a cluster of computers. Moreover, Apache Spark is a distributed computing framework that, unlike the default compute engine Apache MapReduce, runs in memory.

To evaluate the scalability performance of ABSTAT-HD we profile several datasets that have different complexity such as DBpedia and Microsoft Academic Knowledge Graph. Three orthogonal dimensions were considered during profiling process: the size of the dataset, its complexity with respect to the number of types and predicates and ontological features, and the profiling type which considers the overall workload of the profiling process. Experiments show that given a fixed number of worker nodes, there is a linear correlation between the size of the dataset (in terms of number of triples) and the time needed to profile it. Moreover, when regarding the complexity, experiments show that it does not impact the performance. ABSTAT-HD is able to compute the profile for the core-profiling or full-profiling even for complex datasets. ABSTAT-HD is able to process very large KGs such as DBpedia and MAKG for both the core and full-profiling. Clearly, the performance on full-profiling is lower with respect to the core-profiling, as for the latter a greater set of statistics is computed. Finally, despite the size and the complexity of the dataset, full-profiling needs up to 3 times the time for the core-profiling.

Moreover, we have shown that minimalization has an impact also in pruning the pattern space and the execution time. In fact, minimalization halves the number of generated patterns (dbp-2014\(_{566M}\)) and speeds up the workflow execution by a 31%. Furthermore, we proved that for core-profiling ABSTAT-HD can be up to \(\sim 9{\times }\) faster and for full-profiling can be up to \(\sim 35{\times }\) faster than the previous ABSTAT implementation.

Future works include the enrichment of ABSTAT profiles with other statistics about the data. Moreover, we plan to represent profiles based on exiting vocabulary in order to increase the automatic analysis of profile in the exploratory data analysis phase of any machine learning task based on a KG. Furthermore, we want to use ABSTAT-HD to profile a set of KGs in specific field such as biology, geography and so on, with the aim to offer to the community complete, precise and ready to use data based on FAIR principles.