Comparison of Knowledge Graph Representations for Consumer Scenarios

. Knowledge graphs have been widely adopted across organizations and research domains, fueling applications that span interactive browsing to large-scale analysis and data science. One design decision in knowledge graph deployment is choosing a representation that optimally supports the application’s consumers. Currently, however, there is no consensus on which representations best support each consumer scenario. In this work, we analyze the fitness of popular knowledge graph representations for three consumer scenarios: knowledge exploration, systematic querying, and graph completion. We compare the accessibility for knowledge exploration through a user study with dedicated browsing interfaces and query endpoints. We assess systematic querying with SPARQL in terms of time and query complexity on both synthetic and real-world datasets. We measure the impact of various representations on the popular graph completion task by training graph embedding models per representation. We experiment with four representations: Standard Reification, N-Ary Relationships, Wikidata qualifiers, and RDF-star. We find that Qualifiers and RDF-star are better suited to support use cases of knowledge exploration and systematic querying, while Standard Reifi-cation models perform most consistently for embedding model inference tasks but may become cumbersome for users. With this study, we aim to provide novel insights into the relevance of the representation choice and its impact on common knowledge graph consumption scenarios.


Introduction
The growth of the knowledge graph (KG) user base has triggered the emergence of new representational requirements.While RDF is the traditional and standard model for KG representation, alternative models such as property graphs [25], the Wikidata model [34], and RDF-star [12] have also become recently popular.The promise of these alternative and complementary representation models is that they can provide more flexibility to address certain use cases, such as statement annotation, for which RDF-based representations are not straightforward [17].While the plurality of knowledge representation (KR) models provides the means to address a wider range of possibilities in consumer scenarios, there is currently no consensus nor sufficient empirical evidence on which representations are most suitable for different KG consumer tasks [16].
Previous studies comparing knowledge representations have focused primarily on query performance [2,6,14,26,28] and graph interoperability [3,4].For this scenario, the representations need to ensure efficiency to minimize performance time.However, applications relating to exploration by end users and machine learning over KGs have not been taken into account [22,16].Knowledge exploration scenarios, e.g., browsing, are impacted by representational choices, and therefore the selected representations should reduce the cognitive load and user expertise needed to explore, access, and acquire knowledge.Similarly, many embedding models-based tasks such as knowledge graph completion [1,29,37] require adequate representations to maximize the performance of the models on downstream predictive tasks.
In this paper, we address the research question: How do different knowledge representation models impact common KG consumer scenarios?Based on three complementary KG consumer tasks (knowledge exploration, systematic querying, and knowledge graph completion), we define four concrete questions: (RQ1) Which representations facilitate faster knowledge exploration and acquisition?(RQ2) Are certain representations more intuitive for building accurate queries efficiently?(RQ3) How do representational query patterns impact query evaluation time?(RQ4) How do different representations affect the performance of KG embedding models for a KG completion task?
We investigate these research questions by assessing the fitness of four popular KR approaches: Standard Reification, N-Ary Relationships, Wikidata qualifiers, and RDF-Star, for the needs of the abovementioned scenarios.First, to understand user preferences in knowledge exploration tasks, we run a user study where participants interact with a web browser interface and a query endpoint to determine the representation that improves knowledge acquisition for real-world questions.Then, to assess the differential performance of the representations, we test several queries using synthetic and real-world data.Lastly, to estimate the impact on KG embedding model performance, we train and evaluate a selection of these models for the KG completion task with different representations.
The rest of the paper is structured as follows: Section 2 introduces the four representation models.Section 3 describes the datasets used in the evaluation.The experimental setup and evaluation are organized by scenario, for knowledge exploration in Section 4, systematic querying in Section 5, and knowledge graph embedding models in Section 6. Section 7 reviews related work.The conclusions and limitations are discussed in Section 8.

Knowledge Representation Models
In this section, we describe different representation models that can be used for statement annotation, i.e., making statements about statements, a challenge that has motivated the development of several different KG representation approaches [24,26,27,12,7].Figure 1 illustrates instances of these models for the main statement Jodie Foster received the Academy Award for Best Actress annotated with the additional statement for the work The "Silence of the Lambs."Standard Reification [24] (Figure 1a) explicitly declares a resource to denote an rdf:Statement.This statement has rdf:subject, rdf:predicate, and rdf:object attached to it and can be further annotated with additional statements.The resource is typically a blank node; we simplify the encoding using a Wikidata Item identifier (shown as wd:X in the figure for brevity).N-Ary Relationships [27] (Figure 1b) converts a relationship into an instance that describes the relation, which can have attached both the main object and additional statements.This representation is widely used in ontology engineering as an ontology design pattern [10].The Wikidata Model [7] (Figure 1c) is organized around the notion of Items.An Item is the equivalent of either a Class or an Instance in RDF and is described with labels, descriptions, aliases, statements, and site links.Statements represent triples (comprised of item-property-value) and can be further enriched with qualifiers and references.Qualifiers are property-value pairs that attach additional information to the triple.From this point onward, we refer to this representation as "Qualifiers".RDF-star [12] (Figure 1d) extends RDF to introduce triple reification in a

Datasets
WD-AMC (Wikidata -Actors, Movies, and Characters) is a novel subset of Wikidata introduced in this paper to simulate a real-world scenario of manageable size.To this end, we first extract manually a list of actors, characters, and movies present in the questions provided by the WebQuestionSP [35] and GoogleAI benchmarks. 3Then, we sample WD-AMC by taking all properties and values of the main Wikidata statements for the entities in this list, along with their qualifier properties and values.The WD-AMC subset and all its variants are created using the Knowledge Graph Toolkit (KGTK) [21].
For the query evaluation performance scenario, we use the REF benchmark [28], which is proposed to compare different reifications providing the Biomedical Knowledge Repository (BKR) dataset [30] in three representations: Standard Reification, Singleton Property, and RDF-star.We extend the available representations in REF to include Qualifiers and N-Ary Relationships approaches.Table 1 presents the number of triples of both datasets in each representation.We make all datasets and their corresponding queries available online [19,20].

Knowledge Exploration Scenario
We define knowledge exploration as the process of interactively discovering and obtaining information available in knowledge graphs.We distinguish and study two knowledge exploration scenarios by asking users to (i) interact with a userfriendly interface, and (ii) build queries to systematically access the KG.We measure the time and accuracy of the participant responses for both scenarios.

Experimental Setup
User Study Setup We conduct a user study composed of two tasks.The browser interaction task consists of answering 5 natural language questions by looking for the information in Wikidata via a KGTK browser interface.The answer should be provided as a Wikidata identifier (QXXXX).The endpoint interaction task consists of building SPARQL queries for the same natural language questions answered in the previous task, providing a machine-executable query as a response.In this task, participants can build and test the query in a triplestore provided for them.For both tasks, answers are submitted in freetext boxes, with no predefined options to choose from.Participants are provided with representative example responses in order to ensure that they submit the queries in a useful format for the posterior evaluation.We measure the time spent solving each query and the accuracy of the responses.Data For both tasks, we use the WD-AMC dataset described in Section 3.For the browser interaction task, we adapt the dataset to create three instances (one per representation) of the KGTK Browser,4 an adaptive user interface similar to Wikidata.For the endpoint interaction task, we generate the corresponding RDF graphs and upload them to a triplestore.We do not include RDF-star explicitly since its visualization in the first task is equivalent to the Qualifier model representation.All resources are accessible online for the participants. 5,6 Queries We prepare 4 sets of 5 queries, extracting them from the QA benchmarks WebQuestionsSP [35] and GoogleAI, 3 which contain real questions from users.One set of queries is presented in Table 2 in natural language.Each of the 4 sets contains variants of the same 5 queries to minimize participants' risk of copying, by altering specific elements in the questions, e.g., years, movies; while maintaining the structure. 7These 4 versions are applied to the three representations, resulting in 12 different sets of queries.We map these query sets to 12 groups of participants.Participants are provided with the same set of natural language questions for completing both tasks.
Participants The user study is carried out with 45 students of a Master's level course on Knowledge Graphs.All students have similar backgrounds, are enrolled in a University Computer Science or Data Science program, and have learned about the KG representations in the course.The students first sign up for a voluntary assignment and are randomly divided into 12 groups, 4 groups per Fig. 2: Measured time that participants spent for retrieving the answers to the questions proposed in the browser interaction task.
representation and each of them with a different query set.Then, they are sent a Google questionnaire with representation description, instructions, questions for both tasks, and text boxes for the answers.These groups contain: 16 participants for Qualifiers, 16 for N-Ary Relationships, and 13 for Standard Reification.
Metrics We analyze the results using ANOVA and t-test for the response time measurements to look for significant differences among representation approaches per query.Both ANOVA and t-test are used under the assumptions of (i) normality, (ii) sameness of variance, (iii) data independence, and (iv) variable continuity.ANOVA is first used for the three variables (Qualifiers, N-Ary Relationships and Standard Reification).Then, the t-test is used to test pairs of representations per query.To test the significance of accuracy, we use the chi-squared test of independence and look into whether the accuracy and representation variables are independent.It is used under the following assumptions: (i) the variables are categorical, (ii) all observations are independent, (iii) cells in the contingency table are mutually exclusive, and (iv) the expected value of cells in the contingency table should be 5 or greater in at least 80% of the cells.

Results
Browser interaction results Figure 2 shows participants' time spent finding the answer in the browser interface for each query.We observe that the results for the three representations are overall similar, with significant differences in individual queries.The ANOVA test shows significant differences between the representation models (p-value<0.05) for queries Q3-5 (Table 3).Our t-tests for pairs of representation models confirm the results of ANOVA for Q1-2, showing no significant differences.For Q3 and Q5, the measured time is significantly higher for Standard Reification, while for Q4, Qualifiers take significantly less time compared to the other representations.Thus, we observe that the time spent answering questions with the Standard Reification is significantly higher for Q3-5; and the average time in Q1 and Q2 is slightly (but not significantly) higher, Answering RQ1, we conclude that, while participants can find the correct answers with any of the three representations, answering questions via knowledge exploration takes longer with the Standard Reification representational model.This finding is intuitive, as Standard Reification divides a triple into three triples, where only the object is the relevant element.The information is thus scattered and does not follow the "natural" direction of relationships between the elements.For instance, for answering Q3 "Who won the academy award for best actor in 2020?", in Qualifiers and N-Ary Relationships the information stems from the Academy Award for Best Actor (Q103916) node using the winner property (P1346); while for Standard Reification, the statement holds this information in separate relations: <wd:Statement rdf:subject wd:Q103916; rdf:predicate wdt:P1346>.Thus, the answer can be only accessed by referencing from the statement node, rather than directly as for the other representations.Endpoint interaction results Figure 3 shows the distribution of time spent on building the SPARQL queries and Table 4 shows the accuracy of the results obtained when running the SPARQL queries submitted by the participants.In terms of time, we note small variations among the three representations, which are not significant according to the ANOVA test.In terms of accuracy, the results for this task show a higher portion of errors compared to the browser interaction task, and they vary highly among approaches and queries.The largest differences are observed for the queries Q1 and Q5, where the Qualifiers model performs the best (accuracy of 1 and 0.71) and the Standard Reification model has an accuracy of 0.36.In such queries, Standard Reification requires a UNION clause to retrieve the complete set of results, which is not needed for the other representations, and thus, increases the relative complexity of the correct query.The results of the other three queries are relatively close between the three representation models, with Qualifiers performing the worst on all of them.However, on average Standard Reification produces the lowest accuracy, while N-Ary Relationships the highest.To test the significance of the accuracy results, we apply Fig. 3: Measured time that participants spent for building the SPARQL queries corresponding to endpoint interaction task.4), showing significant values for Q1 and on average.Thus, the accuracy of results is in general dependent on the representation.
Addressing RQ2, we observe that all three representations perform similarly in terms of the time it takes to interact with a SPARQL endpoint.However, for queries with a higher complexity, Standard Reification is more error-prone, as it requires additional clauses (e.g., UNION) to retrieve the complete set of results.Curiously, these results are similar to those for the browser interaction task in RQ1, in the sense that Standard Reification fares the worst among the three models, while Qualifiers and N-Ary Relationships perform alternatively best depending on the query.Yet, the granular performance on individual queries and metrics overlaps only partially: in the browser interaction task, the gap is observed in terms of time and affects queries Q3-Q5; while in the endpoint interaction task, it is manifested in terms of accuracy for the queries Q1 and Q5.These findings provide initial insights into the suitability of different representation models for knowledge exploration.We leave a more systematic comparison between queries in terms of their properties for future work.
Table 5: Characteristics of the WD-AMC queries in terms of the number of triple patterns and SPARQL clauses used per query.The RDF-star queries with quoted triples are marked with "*", while " S R " only applies to Std.Reif.The number of checkmarcks (✓) indicates the number of times the clause appears.Clauses 5 Systematic Querying Scenario The systematic querying scenario refers to the assessment of information retrieval efficiency with diverse query loads and structures.We analyze the differential behaviour of diverse series of queries over realistic and synthetic data for each representation, measuring its performance time.

Experimental Setup
Data We use both datasets presented in Section 3. The WD-AMC dataset is used for analysing the behaviour with real-world queries in real-world data.We reuse the REF benchmark to validate our results with their previously reported analysis [28], and extend the resources to test and analyse two additional representations, Qualifiers and N-Ary Relationships.Queries For WD-AMC, we use 10 series of queries.Each series is comprised of 20 variants of one query extracted from the QA benchmarks WebQuestion-sSP [35] and GoogleAI 3 , which contain real questions from users made over Google.We use the first queries shown in Table 2, and we introduce five additional queries, extracted from the same QA benchmarks, to study a wider variety of query structures.The query variants are created by altering specific elements in the query (e.g., years, actors, movies, characters) while maintaining the triple patterns intact.Their characteristics are shown in Table 5.They are selected to apply to the patterns that differentiate the representations variants of WD-AMC.We note that, being extracted from real questions and not designed by us, not all queries require the use of the reification solution in all representations.This is especially remarkable for RDF-star, which is highlighted in the table, and directly affects the results described below.Still, all queries are different across the four representations.
The REF benchmark contains 12 queries divided into three series per approach (A, B, and F).These series contain queries with variable length, complexity, and computationally expensive clauses such as COUNT or FILTER, or operators such as greater than (Table 6).For more details on the queries, we refer to [30] for series A, [26] for series B, and [28] for series F. Implementation Previous studies [14] show no significant differences between the proposed representations across different triplestores.For that reason, we only use Jena Fuseki, which is an open-source implementation that can process all four representations.Both datasets were uploaded to a Jena Fuseki v4.6.1 triplestore running on a single-node VM with a 4-cores CPU and 16GB of main memory.To measure the efficiency of each representation, we measure the evaluation time while assessing the query complexity.Each set of queries for both datasets is run in warm mode three times, and the average is shown as a result.

Results
Results on WD-AMC Figure 4 shows the average query evaluation time on the WD-AMC benchmark.As all queries are naturally asked by users, they are typically not computationally expensive, returning results in less than 100 ms.Q8 is the only exception, requiring more time as it contains additional clauses (2 functions and 2 FILTER; cf.Table 5).In this case, N-Ary Relationships take longer (around 1 minute) as it contains more triple patterns than the other representations.Most of the remaining queries show similar times, with Qualifiers providing the quickest response on average. 7In some queries, Standard Reification needs a higher number of triple patterns compared to the other representations, resulting in higher response times.For Q5, in addition, this representation requires the UNION clause to provide the complete set of results, a clause that is not needed in the other representations.RDF-star performs the best for nearly all queries, which may be due to having the smallest size and lowest number of triple patterns per query.The queries for which this representation shows a significantly increased response time involve expensive clauses or joins.Results on the REF Benchmark Figure 5 depicts the results obtained for the evaluation using the REF benchmark.RDF-star and Standard Reification show similar results to those reported by Orlandi et al. [28].We observe that for Qualifiers, N-Ary Relationships, and Standard Reification, the measured times follow a similar pattern in query evaluation time for all queries in contrast to RDF-star.Qualifiers obtain the best results for almost all queries, presenting a total execution time reduced by half compared to the other representations. 8he measured response times for RDF-star show a completely different behavior.Most of the queries for which this representation presents an increased time response involve joins.Meanwhile, RDF-star is not affected as much as the other representations by the use of operators such as greater than (Table 6).
Addressing RQ3, query performance on both datasets is sensitive to the choice of representation.The Qualifiers representation shows the quickest results for demanding queries on average.Together with N-Ary Relationships and Standard Reification, these representations are usually less differentially affected by a particular clause or pattern compared to RDF-star, as these three show similar behaviour per query.RDF-star stands out for good performance for simple queries (i.e., with one quoted triple) and logical operators such as greater than, but it becomes highly inefficient when joins and clauses like FILTER or COUNT are involved.RDF-star introduces quoted triples, a new element in the syntax that implies different processing by triplestores compared to the other representations, and makes it the most susceptible to change between queries.

Knowledge Graph Embedding Scenario
Machine learning applications over graphs have relied on the idea of knowledge graph embedding (KGE).KGE provides a mechanism to map the nodes and edges in a KG into a high-dimensional vector space.The resulting vector representations are then used for a variety of tasks, including link prediction, node classification, graph classification, and entity resolution.In this section, we evaluate popular KGE methods on the task of knowledge graph completion (KGC), which addresses the sparsity of a KG by predicting an object node given a subject node and a relation.We study the impact of the different KG representations on KGC performance in terms of mean reciprocal rank (MRR) and hits@K.

Experimental Setup
Data We use the WD-AMC dataset described in Section 3. RDF-star is not used in this evaluation because the models that use this representation as their input are not fir for large-scale data.To create the evaluation data, we first extract all the triples containing a statement node and an object node from each representation.In the example shown in Fig. 1, this would yield the triples <wd:X, rdf:object, wd:Q103618>, <wd:Y, wdt:P166, wd:Q103618>, and <wds:Z, ps:P166, wd:Q10361>.Next, for each representation, we randomly sample 10% of the respective triples into a test and validation set, and combine the rest with the remaining triples of each representation to create the training sets.After this procedure, we end up with the following number of triples for our experiments: 1) validation set: 189,839, 2) test set: 189,839, 3) qualifiers train set: 5,128,864, 4) n-ary train set: 3,610,155, and 5) standard reification train set: 5,440,482.Methodology KGE models learn a mathematical relationship between entities and relationships that allow them to produce a likelihood score for an arbitrary triple.The most common architecture for these models is the encoder-decoder framework [11].Traditionally, the encoder consists of learnable shallow embeddings (E, R) that map each entity or relation to a high-dimensional vector, and the decoder is a function that takes in the high-dimensional representations and produces the likelihood score.Formally, these models could be defined as L(s, r, o) = ψ(E(s), R(r), E(o)), where (s, r, o) is the given arbitrary triple, E is the shallow embedding for entities, R is the shallow embedding for relations, ψ is the decoder function, and L is the produced likelihood score.For our experiments, we use three of the most common KGE models with publicly accessible large-scale implementations [23,36], namely RotatE [31], ComplEx [32], and TransE [5].We exclude graph neural network models from this study as none Table 7: KGC results for three KGE models.We bold the best result for a model.

Results
Table 7 presents the experimental results of the models above on the KGC task.We observe that each model has a specific best-performing representation with significant performance differences compared to the other representations.However, the performance gap decreases as the expressive power of the model increases, indicating that the sheer expressive power of the models could potentially overcome representational disadvantages.From the perspective of representations, we observe that although two of the top three best-performing models use N-Ary Relationships, Standard Reification is the best representation on average.This phenomenon showcases the potential downfall of randomly mixand-matching models with representations.Regarding RQ4, our experimental results showcase the importance of associating models with a suitable representation.Failing to do so may lead to degraded performance, as evident from the results obtained on N-Ary Relationships with ComplEx.Throughout our experiments, no single representation consistently achieves the highest performance; however, we observe that a model with superior expressive power, i.e., RotatE, can overcome potential representational shortcomings and perform consistently well over all representations.

Related Work
Multiple studies have assessed the efficiency of different representations in terms of data management efficiency with computationally expensive queries, measur-ing metrics such as execution time, storage size, or number of triples.The Singleton Properties [26] model is compared in its inception with Standard Reification [24] using two series of queries of increasing complexity.Hernández et al. [14] test Wikidata represented in N-Ary Relationships, Standard Reification, Singleton Properties and Named Graphs over multiple triplestores.They remark the processing difficulties for Singleton Properties due to the high number of created properties.In a follow-up work, Hernández et al. [15] investigate how Wikidata in the same representations as their previous study performed on two SPARQL triplestores (Virtuoso and Blazegraph), a relational database (PostgreSQL), and a graph database (Neo4J), reporting that Virtuoso performed best among the data stores.Frey et al. [9] extends this work's evaluation to the DBpedia dataset and its representations to include their proposal Companion Property and Blazegraph's Reification Done Right, 9 which is currently known as RDF-star) This work employs several data stores, highlighting which representation performs best in each store.More recently, the REF-benchmark [28] is proposed to evaluate different reification approaches with the inclusion of RDF-star, providing a version of the Biomedical Knowledge Repository (BKR) dataset [30] and three series of queries for each representation that can be applied to different triplestores.Additionally, there are studies that compare Property Graphs and RDF with regards to query performance [6,2] and model interoperability [3,4], highlighting the advances of Property Graphs.
Hence, studies so far focus on the behavior of multiple representations over triplestores, and occasionally relational databases or other graph databases.To the best of our knowledge, there is no comprehensive evaluation of KG representations across consumer scenarios beyond triplestores and studies of query efficiency, a gap filled in by the extensive evaluation of different representations and scenarios in this paper.Yet, we include the systematic querying scenario in our work to (i) validate our experimental setup with the REF benchmark, obtaining similar results to ones previously reported, (ii) enrich it with two new representations (N-Ary Relationships and Wikidata qualifiers), and (iii) mimic a real-world scenario with real data and queries.

Summary of results
In this paper, we assessed the fitness of different knowledge graph representations in three consumer scenarios: knowledge exploration, systematic querying, and KG embedding.While no single representation model was optimal for all scenarios, we found significant differences for particular consumer scenarios, summarized in Table 8.We can extract the following conclusions from our study: (i) Standard Reification is the least suitable for users.Its anti-intuitive structure results time-consuming to navigate with, and it introduces additional complexity to retrieve correct and complete information.(ii) RDF-star still needs Table 8: Summary of the fitness for each representation evaluated in the studied scenarios, where ✓means suitable, ∽ is acceptable, X is avoidable and * indicates that the value is not tested but equivalent to Qualifiers.

Simple graphs and queries
Large graphs and demanding queries

Qualifiers
improved support in all studies scenarios, as it is underway of becoming part of the RDF 1.2 specification [13].At the moment, it is risky to use it in highdemanding scenarios.(iii) Qualifiers obtain steadily better results for retrieving results in high-demanding querying scenarios.Despite being restricted to Wikidata at the moment, its representation could be considered to be adopted in more knowledge graphs.(iv) Analysing and understanding how each embedding model works is key to select a representation for graph completion (and hence, additional embedding-based tasks).While for the other scenarios all representations showed acceptable behaviour, here the decision is critical.(v) Promoting the use of interfaces such as browsers highly improves the user experience in knowledge exploration.These interfaces help mask the representation complexity and differences, which directly influences the adoption and usability of semantic resources, an aspect usually overlooked.(vi) Despite the lack of a good-for-all solution, promoting interoperability among representations when knowledge graphs are consumed for very different purposes is potentially useful.

Limitations
Knowledge graphs are built for reuse across different applications, and their fitness for these applications depends on representational choices.Our paper provided insights into the fitness of four representations for three diverse consumption scenarios.We reflect on three key design decisions and point to future extensions of this study that can increase the significance of its findings.
1. Representational choices -Our study considered a subset of representations, prioritized for their popularity in prior work.Namely, we selected two RDF-based representations (the Standard Reification proposed in the first RDF Recommendation and the widely used N-Ary Relationships); the Wikidata model, and RDF-star.Other RDF-based representations (e.g.Singleton Properties) and property graphs were out of scope of this paper, but their inclusion in follow-up work is valuable.The choice of representations directly influences the selection of particular techniques and tools, as a representation may be limited to a single technology and cannot be processed by others.This prevents evaluating all existing representations under the exact same conditions.To ensure fair evaluation conditions, property graphs were not included in this work.Hence, further work is needed to include additional relevant representations while ensuring that differences in performance are due to the representation and not the underlying implementation.
2. Task choices -The tasks studied in this paper were derived by surveying popular KG tasks from recent research and associating them with three consumer scenarios.For knowledge exploration, we selected two representative tasks, yet, it remains to be seen whether our findings will generalize to other exploratory tasks, e.g., graph navigation visualization.For systematic querying, we extend previous studies to address a real-world scenario.The generality of this scenario can be increased with a complete real-world dataset instead of a subset with increasingly complex queries and with additional query languages (e.g., Cypher [8] or Linked Data Fragments [33]).For KG embedding, a vibrant research community has explored a vast space of learning methods, applications, and benchmarks.We limit the task to KG completion and investigate simpler, widely used models to focus on a common real-world scenario.We believe these choices balance the likely trade-offs in many deployed settings, where institutional knowledge and dataset size preclude the adoption of more advanced techniques.Alternatives in the task space (e.g., node classification or entity resolution) and the model space, where graph neural network-based architectures can more directly represent complex graph structures, are promising extensions we hope to investigate in future work.
3. Participant sample -The remaining limitation refers to the sample of participants in the knowledge exploration user study.For this first study, we chose Master students because of their short and comparable experience with knowledge graph representations and tasks.We considered them a better sample than colleague practitioners, who may be already biased toward the representations and technologies they use.As a next step of this work, we plan to include a more varied sample of participants, i.e., to include industry experts, junior students, academics, and software developers.

Fig. 5 :
Fig. 5: Measured query evaluation time for the REF benchmark.

Table 1 :
Number of triples (x10 6 ) for the WD-AMC dataset and the REF benchmark in the analysed representations.

Table 2 :
Set 1 of questions used in the user study with their corresponding identifier (ID) and from which QA benchmark they were extracted (Source).

Table 3 :
P-values of ANOVA and t-test for the time that participants spent for retrieving the answers to the questions proposed in browser interaction task.The significant values (p-value <0.05) are highlighted in bold.representation less fit than the other two for this task.To obtain the accuracy of the responses, we measure the proportion of correct responses retrieved.Nearly all of the received answers are correct.Only in one query with N-Ary Relationships and Standard Reification, a wrong answer is submitted.We run a chi-squared test per query, and the results show no significant differences among the approaches in terms of accuracy.

Table 4 :
Proportion of correct responses returned by the SPARQL queries built in the endpoint interaction task.The highest accuracy per query is highlighted in bold, and the lowest is underlined.The p-values of the chi-squared test are also shown.Significant values (p-value <0.05) are marked with *.
chi-squared tests Table

Table 6 :
Characteristics of queries of the REF benchmark regarding the number of triple patterns and SPARQL clauses used per query.The number of checkmarcks (✓) indicates the number of times the clause appears.GT stands for the greater than operator.