1 Introduction

With the popularization of information systems and digital devices, enterprises and organizations accumulate a large amount of valuable or sensitive data locally or in the cloud [14, 23, 33]. Once these data are leaked or used maliciously, it will cause significant economic losses or pose a great threat to users’ privacy [7, 25, 29]. Secure sensitive information is an important issue to protect customers and then attract users [8, 43]. Access control is recognised as the first defence to guarantee that only authorized users can gain access to sensitive data and thus prevent data leakage [22, 25, 30, 32].

The two main categories of the most widely used access control strategies are role-based access control (RBAC) strategies and attribute-based access control (ABAC) strategies [25, 28, 31]. The former assigns permissions only based on user’s roles, which makes it simple to implement and thus widely used in the past [4, 24]. However, with the expansion of the information system scale and the proliferation of users, RBAC strategies are too coarse-grained to meet the needs of sensitive data protection [27, 36]. By contrast, ABAC strategies adopt carefully crafted policies based on multiple attributes from users, environment and resources to assign data access permission. ABAC strategies have become more popular nowadays because they are more fine-grained and flexible than RBAC strategies [17, 34]. For example, the work in [9] proposed an ABAC mining algorithm named Rhapsody to mine ABAC rules from sparse logs.

However, the evolving new information technologies and changes in users’ behaviours bring new challenges [13, 15]. One of the biggest problems is the policy explosion, which means the scale of the policy has increased dramatically [18, 35]. The main reason for the policy explosion is that users’ roles in organizations are becoming more diverse and people are using more different devices to access data in different places. Policy explosion brings two consequences directly, i.e., decreased efficiency of the system and increased misconfiguration [12, 16, 20].

To overcome these problems, more and more researchers are beginning to explore machine learning based access control strategies, which treat access control decision-making as a binary classification problem. Sample features for machine learning classifier training come from available users, environment and resource attributes etc. The corresponding sample labels come from verified access control log files. Some works have successfully classified access control historical records using machine learning methods with high accuracy. But there is no related work that discusses machine learning based access control from the perspective of a data stream. In reality, access control requests form a data stream to feed into the decision making models. Therefore, the work in our previous paper [42] proposed a consecutive batch learning framework to tackle the possible concept drifts by periodically updating the machine learning classifier with new samples.

Furthermore, dynamic class imbalance problems exist in real-world access control applications [39, 41]. In other words, most requests are legitimate and valid, but there will be a very small number of samples that are denied access due to mishandling or malicious attacks. For access control, a rejected access request usually means a malicious access request, which is the minority class. Misclassification of the minority class will cause severe data leakage. Therefore, improving the classification performance of the minority class (access deny) is vital for an access control problem.

To boost the performance of the minority class for access control, our previous work [42] proposed a Boosting Window (BW) algorithm within an adaptive incremental batch learning framework. Although experimental results demonstrated this work can enhance the performance of the minority class, the overall performance is still unsatisfactory because of the limited available attributes and the poor encoding and feature representing methods for high cardinality categorical data. For example, the manager ID is an essential user attribute related to the possible access permission to a specific system resource. However, in a large organization, such as Amazon, there will be millions of different manager IDs. In this case, the values of the manager ID are high cardinality nominal categorical data.

In general practice, one-hot encoding, binary encoding and label encoding are the most popular methods for categorical data encoding [38]. When encoding high cardinality nominal categorical data, all of them have fatal disadvantages. Label encoding can mislead the classifier because of the big differences between numerical values. For example, the classifier can falsely give more weight to a manager with an ID of 100,000 than 1. One-hot encoding can address this problem but it will result in another serious problem, the curse of dimensionality. Binary encoding adopts binary code to represent ordinal values, working as a compromise between label encoding and one-hot encoding. However, binary encoding fails to represent relationships between different samples with the same attribute value.

Instead of encoding the values of different attributes as non-topological features for access control decision making, we extract relationships between entities from attributes and further extract topological features from relationships to train access control classifiers. To better represent user and resource and illustrate relationships between them, this paper constructs an access control domain-specific knowledge graph to assist decision making. As an extension of our previous work, we leverage knowledge graph to handle user and resource attributes with high cardinality values to further boost the performance of the minority class. Compared with the work in [42], our main contributions are as follows.

  1. (1)

    We proposed a knowledge graph empowered online learning framework for access control decision making. To the best of our knowledge, this paper is the first try to leverage knowledge graph to extract graph topological features to improve the performance of the access control model.

  2. (2)

    We proposed an algorithm to construct a knowledge graph from the existing user and resource attributes. We further demonstrate how to extract features from the established knowledge graph to represent users and resources. The extracted features are fed to a machine learning classifier to make access control decisions based on records in log files.

  3. (3)

    We evaluate and verify the proposed knowledge graph empowered online learning framework on a much larger open-sourced real-word dataset and discussed the performance on different imbalance degrees in both online and offline scenarios.

The rest of the paper is organised as follows. Section 2 briefly introduces knowledge graph basics, typical graph topological features and link prediction solutions. Section 3 presents the workflow of the proposed framework, followed by an access control domain-specific knowledge graph construction algorithm and feature extraction details in Section 4. Section 5 displays the experimental results and concludes the paper with a discussion on future work.

2 Related work

A knowledge graph (KG), denoted as \(\mathcal {G}\), is a multi-relational graph composed of entities as nodes and relations as different types of edges [37]. An instance of an edge is a triplet of fact (head entity, relation, tail entity), denoted as (h, r, t). Apart from the graph-structured data model, both entities and relationships can have multidimensional properties to further describe complex data. KG is often used to represent interlinked facts, allowing both humans and computers to extract useful knowledge and further to do reasoning and prediction based on its contents. Typical ways to analyse a knowledge graph include but are not limited to (1) node classification to predict the type of a given node; (2) link prediction to predict whether two entities are linked or not; (3) community detection to identify densely linked entity clusters and (4) network similarity measurement to evaluate the similarity between two nodes or two networks.

The access control problem can be formulated as a link prediction problem between user entities and resource entities, which is essentially a binary classification problem. Specifically, if an access approve link exists between a user entity u and a resource entity r, the access request (\(u\rightarrow r\)) will be approve. Otherwise, it will be refused. Once a knowledge graph has been constructed, a variety of graph topological features can be extracted to describe the local or global connections between entities based on homogeneous or heterogeneous subgraphs within the knowledge graph.

A basic solution for link prediction is structural similarity-based unsupervised learning methods, which determine the likelihood of linkage between two nodes based on some similarity or closeness indices deduced from the graph structure. When an index between two nodes exceeds a predefined threshold, they are considered to have a link between them. Common Neighbours (CN) [10] measuring the number of shared nodes between two nodes is the most intuitionistic index to indicate the linkage possibility of them. Similar indices, to name a few, include Adamic Adar (AA) [2], Preferential Attachment (PA) [3] and Resource Allocation (RA) [44]. Their definitions are listed as (1)-(3) for reference.

$$C N(u, v)=|\mathcal{N}(u) \cap \mathcal{N}(v)|,$$
(1)
$$A A(u, v)=\sum\limits_{w \in \mathcal{N}(u) \cap \mathcal{N}(v)} \frac{1}{\log |\mathcal{N}(w)|},$$
(2)
$$P A(u, v)=|\mathcal{N}(u)| *|\mathcal{N}(v)|,$$
(3)
$$R A(u, v)=\sum\limits_{w \in \mathcal{N}(u) \cap \mathcal{N}(v)} \frac{1}{|\mathcal{N}(w)|},$$
(4)

where u, v, w are nodes in the target graph, \(\mathcal {N}(\cdot )\) denotes the set of nodes adjacent to the specified node in the brackets, |⋅| denotes the number of distinct nodes in the specified set. These indices are widely used in various domains because of their simplicity and reasonable performance. However, they only considered the node pair’s local connectivity and ignored the global structure of a graph.

By contrast, global connectivity indices can provide more overall graph topology information. A well-known index for taking global connectivity into account is the Katz Index (KI), which leverage the length of paths between a pair of nodes to measure their similarity. KI can be calculated as (5) [11].

$$K I(u, v) ={\sum}_{l=1}^{l_{\max }=\infty} \beta^{l} \cdot\left|\text{path}_{u, v}^{l}\right|,$$
(5)

where l is the length of a path between nodes u and v, \(\left |\text {path}_{u, v}^{l}\right |\) is the total number of distinct paths between node u and v with length l, β is a coefficient between 0 and 1 used to adjust the contribution of paths to KI.

Another popular global connectivity index is Average Commute Time (ACT), which calculates the average number of steps required by a random walker starting from node u to reach v and vice versa [1]. The ACT between nodes u and v can be calculated as (6) [26].

$$S_{A C T}(u, v)=\frac{1}{l_{u u}^{+}+l_{v v}^{+}-2 l_{u v}^{+}}$$
(6)

where \(l_{u u}^{+}\), \(l_{v v}^{+}\) and \(l_{u v}^{+}\) are the corresponding entries in Laplacian Matrix, L+.

Obviously, a common drawback for global connectivity indices is relatively higher computation cost compared with local connectivity indices. Decentralized approaches or parallel computing are also incapable of dealing with global graph computation, because the structural connectivity would be damaged by splitting the graph for decentralized or parallel computing. Therefore, these measures are not suitable for large-scale connected graphs.

Generally speaking, the common advantages of structural similarity-based unsupervised learning methods algorithms include that (1) they do not need labelled data to train a classifier; (2) the link prediction result is explainable based on the definition of the corresponding indices; (3) they often take less computation effort for costly feature engineering and classifier training procedures.

However, there is still no universal feasible method to determine the appropriate threshold for different indices and application domains. Besides, these methods are also criticised for poor performance due to only taking topological features into account and neglecting the attributes of nodes and relationships, which contain rich domain knowledge and play critical roles for most domain-specific link prediction tasks. Therefore, in most cases, when labelled data is available, supervised learning methods are more preferable due to superior performance and the flexibility of feature extraction.

When applying supervised learning methods, both non-topological features and topological features can be used to feed into a machine learning classifier to support link prediction. Non-topological features refer to the attributes of entities and relationships, which contain rich multi-modality domain knowledge. For example, in an access control knowledge graph, the non-topological features of a user entity include sector, department name, job title, job description, etc. By contrast, topological features refer to graph structural features for node representing. In addition to aforementioned local and global connectivity indices, common traditional node topological feature extraction methods in graph theory include Page Rank [6], Article Rank [19], Betweenness Centrality [5], Harmonic Centrality [21], etc.The performance of supervised machine learning methods for link prediction is determined by the capability of the extracted non-topological and graph topological features as well as the capability of the applied classifier.

3 Methodology

We propose a general knowledge graph empowered online learning framework for access control in this section. Firstly, we introduce the workflow of the framework. Then we detail the construction algorithm of an access control domain-specific knowledge graph and the KG-based topological feature extraction method.

3.1 Workflow of the proposed framework

The supporting information for the access control decision-making problem studied in this paper includes user attributes, resource attributes and a verified access control log file in chronological order. According to the cardinality of category user attributes and resource attributes, an access control knowledge graph is constructed. The specific knowledge graph construction and refactoring algorithm is given in Section 3.2 and a real-world use case is demonstrated in Section 4.

Similar to our previous work [42], the proposed framework is essentially a classifier-agnostic consecutive incremental batch learning process for access control decision-making. Within this framework, a randomly initialized binary machine learning classifier works as the access control decision-maker, denoted as \(f_{{\varTheta }}^{(0)} (\cdot )\), where Θ is the trainable parameter set of f(⋅) and (0) means the initialization status of the time step. The classifier \(f_{{\varTheta }}^{(0)} (\cdot )\) is constantly updated at each time step as new samples are available for classifier training. We demonstrate the main process of a typical time step t (t > 0) in Figure 1, which consists of two stages, namely, the predicting stage and the adaptation stage.

Fig. 1
figure 1

Workflow of the proposed consecutive incremental batch learning framework

At the predicting stage of the t-th time step, when user u request a resource r, denoted as (uv)(t), the classifier \(f_{{\varTheta }}^{(t)}(\cdot )\), which is updated at time step t − 1, will make decision on the access control request (uv)(t). Firstly, six feature sets related to this access control request will be extracted from the constructed access control knowledge graph, i.e., \(x^{(t)}_{u_{N}}\), \(x^{(t)}_{u_{T}}\), \(x^{(t)}_{r_{N}}\), \(x^{(t)}_{r_{T}}\), \(x^{(t)}_{(u \to r)_{N}}\) and \(x^{(t)}_{(u \to r)_{T}}\). Among them, \(x^{(t)}_{u_{N}}\), \(x^{(t)}_{r_{N}}\) and \(x^{(t)}_{(u \to r)_{N}}\) are the non-topological feature sets extracted from the user entity, the resource entity and the existing relationships between them. Similarly, \(x^{(t)}_{u_{T}}\), \(x^{(t)}_{r_{T}}\) and \(x^{(t)}_{(u \to r)_{T}}\) are the corresponding topological feature sets. The details of the feature extraction process are described in Section 3.3. These six feature sets are then preprocessed (outlier replacing and normalization) and integrated into one feature set x(t). Finally, the classifier \(f_{{\varTheta }}^{(t)} (\cdot )\) will make decision on the request (uv)(t) according to the result of equation (7).

$$\hat{y}^{(t)}= f_{{\varTheta}}^{(t)}(x^{(t)}), (t>0).$$
(7)

At the adaptation stage of the t-th time step, the verified ground truth y(t) corresponding to the request (uv)(t) is available and can be extracted from the verified access control log file. Then, the labelled samples x(t), y(t) can be used to finetune the classifier \(f_{{\varTheta }}^{(t)} (\cdot )\). The fine-tuned classifier, denoted as \(f_{{\varTheta }}^{(t+1)}(x^{\cdot })\), will be used at the predicting stage of the t + 1-th time step. Since the classifier keeps updating with the latest verified samples {x(t), y(t)} at each time step t, it can learn possible new concepts emerging at time step t.

3.2 Access control knowledge graph construction

A knowledge graph consists of a set of entities (with multiple entity labels) and relationships between entities (with multiple relationship types). Each entity or relationship has its identification number and some of them have one or more properties. To construct an access control knowledge graph is to identify all entities including their labels and properties, and all relationships including their relationship types.

3.2.1 Attribute type

We construct an access control knowledge graph \(\mathcal {G}\) from existing user attributes and resource attributes information. Apart from the ID attribute, from the perspective of constructing KGs, there are three kinds of attributes in Attu and Attr, i.e., Type 1, attributes showing the relationships between users and resources; Type 2, high cardinality categorical attributes; Type 3, the rest attributes. Let 𝜃 be a preset cardinality threshold. If the cardinality of a categorical attribute is larger than 𝜃, it is a Type 2 attribute; otherwise, Type 3. The attribute types work as a guideline for step by step knowledge graph construction, seeing details in Section 3.2.2.

3.2.2 Algorithm pseudocode

Let Attu be the list of users’ attribute names, in which an attribute name‘userID’ is included. Xu denotes the attributes’ values according to Attu. xuidXu is a vector containing all users’ ID. Similarly, Attr is the list of resources’ attribute names containing a ‘resourceID’. Xr is the attributes’ values according to Attr and xridXr is a vector containing all resources’ ID. We elaborate on the construction process of an access control knowledge graph in Algorithm 1.

figure a

Some executive statements of the pseudocode in Algorithm 1 are written in Cypher query language, which is the graph query language for the Neo4j graph database. The naming convention thus follows the Cypher coding standards, where entity labels are in CamelCase; property keys are in camelCase and relationship types are in upper-case, such as FOLLOWS in a social media knowledge graph. As listed below, the process of access control knowledge graph construction can be divided into four main steps:

Step 1: create User and Resource entities as shown in lines 2-7. According to the userID and resourceID attributes, we create two entity types with a User label and Resource label respectively. For each unique userID uid in xuid, we create a User entity with a userID property as shown in lines 2-4. Similarly, we create Resource entities with a resouceID property based on the rid in xrid as shown in lines 5-7.

Step 2: create properties or relationships for User entities from user attributes as shown in lines 9-32. In line 9, attn refers to the attribute name traversing Attu; xu is the corresponding attribute values and Xu-xuid means the relative complement of xuid in Xu. In other words, Xu-xuid means all attributes’ values except xuid. Steps 2.1-2.3 give details on how to create properties or relationships for User entities based on three attribute types defined in Section 3.2.1.

Step 2.1: create relationships from User entities to Resource entities based on Type 1 attributes as shown in lines 11-19. When a user attribute attn indicates a relationship between users and resources, we search the particular User entity and Resource entity based on the attribute value attv and create a HASattn relationship between, where HASattn is the relationship type in upper-case format. We further record the import attribute information in attv as a property of the created relationship, named attn Property. In line 12, attv means the attribute value of attn corresponding to the user with userID=uid. In line 13, Ridt means a temporary resourceID set to distinguish it from xrid line 5.

Step 2.2: create new types of entities for high cardinality categorical user attributes as shown in lines 21-25. To better represent high cardinality categorical features, we create new entities with a label named attn and a property named attn Property to record the value of the high cardinality categorical user attribute. Then, we create a relationship with a type of HASattn to indicate that the user with userID=uid has a relation with the newly created entity.

Step 2.3: create new properties for User entities from the Type 3 attributes. The rest of user attributes are all added as the properties of the User entities as shown in lines 28-31.

Step 3: create properties or relationships for Resource entities from resource attributes as shown in lines 34-57. The process of Step 3 is similar to Step2. To avoid redundancy, we no longer describe the detailed process in words.

Step 4: refactor the above-established access control knowledge graph as shown in lines 61-61. We add a SHAREref relationship between User Entities who share the same attn entities created in line 23. The subgraph consisting of SHAREref relationships and User entities can be used to extract topological features to represent the original user attribute attnAttu. Similarly, we also add a SHAREref relationship between Resource Entities who share the same attn entities created in line 49 to facilitate the topological feature extraction from information provided by the original user attribute attnAttr.

After the aforementioned four steps, an access control knowledge graph \(\mathcal {G}\) is established for topological feature extraction.

3.3 Feature extraction for access control

To train the classifier fΘ(⋅) for access control, we use the log file containing access control requests and their corresponding verified decision (approval or refuse) to form labelled samples. Specifically, for each request from user u to resource r at time step t, denoted as (uv)(t), six sets of features can be exacted from the access control knowledge graph \(\mathcal {G}\) constructed with Algorithm 1, namely, \(x^{(t)}_{u_N}\), \(x^{(t)}_{u_T}\), \(x^{(t)}_{r_N}\), \(x^{(t)}_{r_T}\), \(x^{(t)}_{(u \to r)_N}\) and \(x^{(t)}_{(u \to r)_T}\), as shown in Figure 1.

Among them, \(x^{(t)}_{u_N}\), \(x^{(t)}_{r_N}\) and \(x^{(t)}_{(u \to r)_N}\) are the non-topological feature sets extracted from the User entity u, the Resource entity r and the existing relationships between them uv. These three non-topological feature sets can be exported from the properties of entities u, r and relationships ur.

By contrast, \(x^{(t)}_{u_T}\), \(x^{(t)}_{r_T}\) and \(x^{(t)}_{(u \to r)_T}\) are the corresponding topological feature sets. \(x^{(t)}_{u_T}\) is extracted from a subgraph of the constructed access control knowledge graph \(\mathcal {G}\) which consists of User entities and relationships between them. Similarly, \(x^{(t)}_{r_T}\) is extracted from a subgraph containing Resource entities and relationships between them. Both \(x^{(t)}_{u_T}\) and \(x^{(t)}_{r_T}\) are the topological features extracted to present the entities. The extracted topological features include but are not limited to (1) centrality scores which determine the importance of distinct nodes in a graph, such as page rank scores and betweenness scores; (2) community detection scores which indicate how groups of nodes are clustered or partitioned, as well as their tendency to strengthen or break apart, such as the weakly connected component id and triangle count of an entity. \(x^{(t)}_{(u \to r)_T}\) is extracted from a subgraph containing User entities, Resource entities and relationships between User and Resource entities. \(x^{(t)}_{(u \to r)_T}\) is used to present the closeness of entities u and r based on the graph with relationships between u and t. The possible features of \(x^{(t)}_{(u \to r)_T}\) include but are not limited to Adamic Adar scores and common neighbours.

4 Experiment results

This section introduces a real-world access control dataset. Then provides a use case of the access control knowledge graph construction algorithm described in Algorithm 1 on this dataset. Finally, we compare the access control performance on topological features extracted from the established knowledge graph and non-topological features. Results show that the proposed knowledge graph empowered method outperforms non-topological methods in both offline and online scenarios.

4.1 Dataset

The experiments of this work are conducted on an open-source real-world Amazon employee access datasetFootnote 1. The dataset contains a file listing all user and resource attributes and a time-series log file containing 684,374 user to resource access control requests and the corresponding permission records. The dataset is extremely imbalanced with 10,911 (1.59%) access rejection and 673,463 (98.41%) access approval. The dynamic data imbalance status is shown in Figure 2. Subplot (a) shows the overall imbalance factor of the refused requests and approved requests. The overall imbalance factor of the refused requests gradually converges to 1.59% after a fluctuation at the early stages and the approved requests converge to 98.41%. Subplot (b) shows the sliding window imbalance factor [40] of the two classes when the sliding window size is set to 100.

Fig. 2
figure 2

Dynamic data imbalance statuses over time

Table 1 lists the basic information of the attribute file. means Not Applicable As shown in Table 1, there are 36,063 unique users and 33,252 unique resources. We set the cardinality threshold 𝜃= 300. Based on the three attribute types defined in Section 3.2.1, the corresponding attribute type is listed in the Type column. The type information can be used to guide the knowledge graph construction, which is described in Section 4.2. For the Type 1 attribute, the cardinality is not applicable (NP).

Table 1 Dataset information

4.2 Access control knowledge graph construction

According to the dataset introduced in Section 4.1, we construct an access control knowledge graph following the steps in Algorithm 1. The main process and intermediate knowledge graph construction results are summarised in Table 2.

Table 2 Usecase of Algorithm 1

In Step 1, based on the user attribute PERSIN_ID, 36,063 User entities with a userID property are created and 33,252 Resource entities with a resourceID property are created using the RESOURCE_ID attribute.

In Step 2, more entities, relationships and properties are created based on the three types of user attributes. Specifically, in Step 2.1, a HAS_P_ACCESS relationship is created between User and Resource entities to show the possibility of access requests based on Type 1 RESOURCE_LIST attributes. In step 2.2, three Type 2 attributes, namely, MGR_ID, DEPTNAME and BUSINESS_TITLE, are used to create three types of entities. The newly created entity labels are Manager, Department and Title respectively. The attribute values are added as the entity properties as shown in Table 2. The m.managerID means we created a manageID property for Manager entities. Similarly, d.deptID and t.titleID are properties added to Department and Title entities respectively. Furthermore, a HAS_MANAGER relationship is created between User and Manager entities. Similarly, a HAS_DEPT and a HAS_TITLE relationship is also created between User and Department entities as well as User and Title entities. Finally, in Step 2.3, 7 Type 3 attributes are added as the properties of User Entities, as shown in Table 2.

In Step 3, only a Type 3 attribute, RESOURCE_TYPE, is available and we add a resourceType property to Resource entities.

In Step 4, a SHARE_P_USER relationship is created between two Resource entities who have HAS_P_ACCESS relationship with the same User Entity. Similarly, three relationships, namely, SHARE_MANAGER, SHARE_DEPT and SHARE_TITLE, are created respectively between two User entities who have HAS_MANAGER/ HAS_DEPT/ HAS_TITLE relationships with the same Manager/ Department/ Title entity.

Finally, an access control knowledge graph \(\mathcal {G}\) is constructed based on the Amazon access control dataset. The data model (schema) of \(\mathcal {G}\) is illustrated as Figure 3. A circle presents a type of entity with a bold label inside. Below the label is the total number of entities with that label. The properties of the corresponding entities are also listed inside the circle. An arrow represents a directed relationship. We also specify the relationship type and the total number of relationships along the arrow.

Fig. 3
figure 3

The data model of the constructed access control knowledge graph \(\mathcal {G}\)

4.3 Feature extraction

We implement the access control knowledge graph \(\mathcal {G}\) on the Neo4j Footnote 2 graph data platform, which provides a convenient way for both topological and non-topological feature extraction from existing entities, relationships and subgraphs of a knowledge graph. The topological features adopted in this work are implemented with the Neo4j Graph Data Science Library Footnote 3.

For an access control request from a user u to a resource r, six feature sets, i.e., \(x_{u_N}\), \(x_{u_T}\), \(x_{r_N}\), \(x_{r_T}\), \(x_{(u \to r)_N}\), \(x_{(u \to r)_T}\) are extracted from the User entities, Resource entities and their relationships. Table 3 presents our feature extraction strategies in detail. As shown in the first row, \(x_{u_N}\) is extracted from the properties of the User entity u. \(x_{u_T}\) presents the topological features extracted from subgraphs containing User entities and the relationships between them, as shown in the second row. The listed features are extracted to represent the importance or connectivity characteristics in the subgraphs. Similarly, \(x_{r_N}\) and \(x_{r_T}\) are the non-topological and topological features of Resource entity r. In this usecase, no properties are added to the relationship SHARE_P_USER, therefore, \(x_{(u \to r)_N}\) is none. We select two link prediction topological features, i.e., preferentialAttachment Footnote 4and totalNeighbor Footnote 5, to present \(x_{(u \to r)_T}\) in this work.

Table 3 Feature extraction details

4.4 Evaluation metrics

For classification problems, based on the true labels and the predicted results, samples in the test set can be classified as:

  • True Positives (TP): the correctly predicted positive samples, which means both the true label and the predicted result of these samples are a positive class.

  • True Negative (TN): the correctly predicted negative samples, which means both the true label and the predicted result of these samples are a negative class.

  • False Positives (FP): the incorrectly predicted positive samples, which means the true label is negative, but the predictive result is positive.

  • False Negatives (FN): the incorrectly predicted negative samples, which means the true label is positive, but the predictive result is negative.

We evaluate the classification performance of the proposed approach on four well-known metrics, namely, accuracy, precision, recall and F1 score. Their definition are given as (8) - (11).

$$Accuracy = \frac{TP+TN}{TP+FP+FN+TN}.$$
(8)
$$Precision = \frac{TP}{TP+FP}.$$
(9)
$$Recall = \frac{TP}{TP+FN}.$$
(10)
$$F1 Score = 2\times\frac{Recall\times Precision}{Recall+Precision}.$$
(11)

4.5 Offline learning performance comparison

To verify the effectiveness of the proposed knowledge graph empowered framework, we compared the access control decision-making performance of using topological features extracted from established knowledge graph and non-topological features from original user and resource attributes on both online and offline scenarios.

Firstly, we verify the performance improvement of topological features on five different classifiers, i.e., naive Bayes (GNB), logistic regression (LR), neural network (NN), random forest(RF), and support vector machine (SVM). We use the scikit-learn Footnote 6 library to implement these classifiers. Considering the importance of class 0 (request rejection) in access control problems, results on both class 0 and the macro average on class 1 and class 0 are reported in Table 4. The results in Table 4 are conducted on a balanced dataset consisting of all negative samples of the original Amazon dataset introduced in Section 4.1 and the same number of positive samples randomly selected from the original dataset. Both the negative and positive samples keep the same order as the original dataset.

Table 4 Performance comparison of different classifiers on offline scenario

Although accuracy (Acc) is the most-used metric for evaluating classification models, it only works on balanced datasets. For severe imbalanced datasets, the results of accuracy (Acc) can be misleading and unreliable. Since the F1 score is an evaluation metrics combining two competing metrics, i.e. precision (Pre) and recall (Rec), we mainly discuss the F1 score when comparing the performance of topological and non-topological features. Δ F1 is the growth rate between the F1 score achieved on topological and non-topological features, calculated by Equation (12).

$${\Delta} \text{F1}= \frac{F1_{\text{Topo}}- F1_{\text{Nontopo}}}{F1_{\text{Nontopo}}} \times 100\%, (t>0).$$
(12)

As shown in Table 4, RF classifier achieves the best performance on all metrics with topological features. Using topological features extracted from the access control knowledge graph increases the F1 score on class 0 from 70.08% to 73.51%, which achieves an increase of 4.89%. Actually, an improvement of 4.21% also achieved on macro average F1 score by using topological features. The performance on NN and LR classifiers are also boosted on both macro average and class 0 with topological features. However, for the GNB classifier, the macro average F1 score is improved from 45.43% to 60.43% with a cost of the decrease of F1 score of class 0 from 66.49% to 58.59%. It means that topological features increase the performance on class 1 but decrease on class 0 when using the GNB classifier. By contrast, the SVM classifier increases the F1 score of class 0 from 59.92% to 66.69% but the macro average f1 score decreases from 61.04% to 35.21%. Generally speaking, it is fair to say that the topological feature can improve access control performance in the offline learning scenario.

To further verify the improvement effectiveness of topology features in different data imbalance statuses, we use an RF classifier, which performs the best in Table 4, to conduct experiments on different class proportions in an offline scenario, as shown in Table 5. Consistent with Table 4, topological features can improve the access control performance of both macro average and the minority class (class 0) on different degrees of imbalanced datasets. However, with the increase of data imbalance, the performance of the algorithm gradually deteriorates, but the results are still much better than a random decision. Specifically, topological features improve the macro average f1 score by 4.60%, 1.30% and 1.29% respectively when the class 0 accounts for 30%, 10%, 1.59% (the original dataset) in the dataset. Furthermore, topological features are superior in improving the performance on class 0, which records an increase of 10.30%, 5.28% and 33.83% accordingly.

Table 5 Offline learning performance comparison on different data imbalance statuses

4.6 Online learning performance comparison

We also conduct online learning experiments on different degrees of imbalance statuses to verify the effectiveness of topology features in improving access control performance. Table 6 shows the overall performance comparison results. The time step size is set as 1/1000 of the dataset size. Topological features improve the macro average f1 score by 2.37%, 2.7% and 1.45% respectively when the class 0 accounts for 30%, 10%, 1.59% (the original dataset) in the dataset. In particular, topological features are superior in improving the performance on class 0, which records an increase of 7.28%, 17.10% and 24.31% accordingly.

Table 6 Overall performance comparison of online learning

Figure 4 shows the real-time macro average performance comparison of online learning. The red lines show access control performance comparison when using topological and non-topological features on a dataset with class 0: class 1 = 3:7. Similarly, the brown lines and blue lines present the results of the original dataset and a dataset with class 0: class 1 = 1:9. Figure 4 demonstrates that topological features can improve the overall f1 score without decreasing the accuracy.

Fig. 4
figure 4

The real-time macro average performance comparison of online learning

Similarly, Figure 5 shows the real-time performance comparison of online learning on class 0 (the minority class). Though the trends are the same with Figure 4, the degree of improvements are larger in Figure 5 .

Fig. 5
figure 5

The real-time performance comparison of online learning on class 0

4.7 Discussion

Results shown in Tables 5 and 6 demonstrated the effectiveness of topological features in improving the access control performance in both offline and online scenarios. However, for privacy and security reasons, the Amazon access control dataset only provides 12 categorical user attributes and 2 resource attributes. These attributes use ID numbers to distinguish different values to prevent sensitive data leakage. It is very challenging to achieve high predictive performance without more text attributes to provide rich semantic information for mining. Therefore, the overall performance and the minority class performance is still unsatisfactory.

In fact, the problem of data insufficiency, especially the lack of attributes information, is common for access control. ABAC rule mining algorithms also suffer from severe overall performance deficiency caused by the poor quality of available real-world access control datasets. For example, the work in [9] proposed an iterative rule mining algorithm, named Rhapsody, to automatically mine ABAC rules from sparse logs and prevent over-permissiveness. They reported the F1 scores of five ABAC rule-based algorithms including Rhapsody on the same Amazon dataset with us. The range of the reported F1 scores is from 0.01 to 0.35, which is equivalent to our method. However, they only choose the top eight most requested resources and their corresponding requests to form eight instances for algorithm evaluation instead of evaluating the algorithm on the whole log file as we do. Therefore, the generalization performance of their algorithm is not guaranteed.

5 Conclusion

To better encode high cardinality categorical user and resource attributes and improve the machine learning based access control performance, we proposed a knowledge graph empowered online learning framework for access control decision-making. Through transferring tabular user and resource attributes into a comprehensive knowledge graph, we extracted topological features from the established knowledge graph to represent uses and resources. Experimental results show that topological features outperform non-topological features encoded by binary encoding method in both online and offline settings.

In future, we will find datasets with rich user and resource attributes to further verify the proposed knowledge graph based online leaning framework. Furthermore, more deep learning based embedding algorithms will be explored to extract high level entity and relationship features to further improve the access control performance.