Introduction

The term “ontology” was originally used in the field of philosophy, meaning “the essence of things, itself”, and the abstract nature of real things. Others describe ontology as: “Ontology defines the relevant terms in the field, associations and rules of domain lexical epitaxy”. Ontology theory was first introduced into the field of computer artificial intelligence. Later, the ontology was widely used in the computer field, and more and more domain experts defined the ontology. At present, the most extensive and most popular ontology is defined as “the explicit formal specification of sharing conceptualization”, whose meaning includes four aspects:

  • sharing, means people agree on the expression of ontology;

  • conceptualization, means that ontology is an abstract expression of the real world;

  • explicit, means concepts and conceptual relationships are accurately and clearly defined;

  • formal, means that concepts and relationships in ontology are described as machine-recognizable forms.

Domain ontology is an ontology type that is developed through ontology research, and the ontology is classified according to the degree of domain dependence. Specifically, ontologies can be divided into four categories: top-level ontology, domain ontology, task ontology, application ontology. Since ontology can be regarded as a structured collection of concepts, the inter-relationship between concepts and the structural features of ontology are the essential problems of various applications of ontology, and thus the semantic similarity calculation between ontology concepts becomes the core of different ontology algorithms.

As a hot topic in the field of computer science and information technology, ontology algorithms have always been the key of data retrieval, image analysis, and applied to many frontier fields such as big data, Internet of Things, and deep learning. Subramaniyaswamy et al. [1] provided a personalized food recommender system in IoT-based healthcare system in which ontologies are used to bridge the gap between descriptions and heterogeneous user profiles. Mohammadi et al. [2] examined the ontology alignment systems using statistical inference where some mathematical tricks like Wilcoxon signed-rank and asymptotic tests are recommended based on their statistical safety and robustness in different settings. Morente-Molinera et al. [3] suggested a trick that uses sentiment analysis procedures to automatically obtain fuzzy ontologies, and also multi-granular fuzzy linguistic modelling are employed to choose the optimal representation mean to store the information in fuzzy ontology. Sacha et al. [4] raised an ontology VIS4ML to describe and understand existing VA workflows used in machine learning. Kamsu-Foguem et al. [5] pointed out that different flows and their combinations can be dealt with by means of semantic Web concepts and conceptual graph theories which permit rules to be imbued to improve reasoning. Janowicz et al. [6] focus on how to fully use SOSA, including integration with the new release of the SSN ontology. Sulthana and Ramasamy [7] proposed a neuro-fuzzy classification trick in light of fuzzy rules, and ontology facilitates a systematic and hierarchical methodology to manage the context. To deal with syntactic evolution in the sources, Nadal et al. [8] introduced a technology that the ontology upon new releases are adapted semi-automatically. Scarpato et al. [9] presented the reachability matrix ontology to describe the networks and the cybersecurity domain, and then to compute the reachability matrix. Karimi and Kamandi [10] raised Inductive Logic Programming (ILP) based ontology learning algorithm and used it to solve the ontology mapping problem.

Various ontology algorithms are widely employed in different engineering fields. Koehler et al. [11] introduced the expansion of the Human Phenotype Ontology (HPO). Chhim et al. [12] presented an efficacious product design and manufacturing process based ontology for manufacturing knowledge reuse. Ali et al. [13] manifested a consensus-based Additive Manufacturing Ontology (AMO) and presented how to use it for promoting re-usability in dentistry product manufacturing. Neveu et al. [14] proposed open-source Phenotyping Hybrid Information System (PHIS) with its ontology-driven architecture for building relationships between objects and enriching datasets with knowledge and metadata. Kiefer and Hall [15] updated gene ontology analysis for stimulate further research and possible treatment. Jaervenpaeae et al. [16] described the systematic development process of an OWL-based manufacturing resource capability ontology and its capabilities of manufacturing resources. Serra et al. [17] demonstrated a proof of concept for leveraging the built-in logical axioms of the ontology to classify patient surface marker data into appropriate diagnostic classes. Di Noia et al. [18] proposed to structure the knowledge associated with NFRs in terms of fuzzy ontology for tool-supported decision making in architectural design. Ledvinka et al. [19] determined the implementation of an ontology-based information system for aviation safety data integration. Aameri et al. [20] raised an ontology to specify shapes, parthood and connection in mechanical assemblies such that the constraints of feasible configurations can be logically expressed and used during generative design.

With the proliferation of ontology processing concepts, machine learning algorithms are applied to ontology similarity calculations (some ontology learning tricks can be referred to Gao et al. [2123] and [24]). Among them, the ontology learning algorithm in multi-dividing setting has proved to be more efficient for the similarity calculation under the tree-shaped ontology structures (see Gao et al. [25], Gao and Farahani [26], Wu et al. [27], Sangaiah et al. [28] and Gao and Xu [29] for more details). Due to the engineering accuracy of the multi-dividing ontology learning algorithm has confirmed by different ontology applications, this paper no longer gives the experimental results of the algorithm under special ontology data, but from the statistical point of view. The approximation property of the multi-dividing ontology learning algorithm in a special expression setting is given.

In recent years, cloud computing has received widespread attention, and the number and types of cloud services it provides been increasing year by year (see Bryce [30] and Song et al. [31]). Scholars are considering how to quickly find the cloud services that users need and effectively provide them to users (see Dimitri [32]). The traditional cloud service is based on the search of keywords, and the query results will contain irrelevant information. At the same time, due to the defect of keyword matching, it is easy to miss related services, which is mainly because the traditional cloud query services don’t have keywords concept expand query function. In order to solve the above-mentioned problems in existing cloud services, cloud ontology-based semantic networks can provide users with more accurate cloud services for different needs information. Therefore, applying ontology to cloud computing and cloud services is definitely worth of looking forward to the future, which encourages us to design specific ontology algorithms in cloud ontology according to specific cloud service requirements (see Sangaiah et al. [33] and [34]).

The rest of paper is organised as follows: first we introduce the setting of ontology learning and in particular multi-dividing ontology learning; then the main theoretical result and its detailed proof are determined; finally, we manifest two experiments on university and mathematical ontology data to demonstrate the efficiency of the algorithm.

Ontology learning problem

Throughout our paper, we use a graph to represent the structure of the ontology. The vertices in the graph represent the concept of the ontology, while the edges between the vertices express a direct subordinate relationship (or inclusion relationship, affiliated relationship) between the two concepts. In order to facilitate the mathematical representation of ontology learning setting, we need to do some processing and specification on the ontology data in the begining stage, which has met the requirements of the later mathematical expression.

First of all, we numerically denote the semantic information, knowledge background, structure, instance, attribute and classification information corresponding to a concept, and then encapsulate it in a fixed-dimensional vector. Through a certain technical means, we can unify the dimensions of the vectors corresponding to the vertices of all ontology concepts, and specify the same type of information to be expressed of the corresponding sequence number of components in the vectors. In this way, the ontology information is represented by the corresponding vector space, and thus the processing and calculation of ontology data can be transformed into the processing and calculation of multi-dimensional vectors. In what follows, on the premise of not causing confusion, in order to facilitate the representation, we use v=(v1,⋯,vp) (assume vp) to simultaneously represent the ontology concept, the vertex in the ontology graph corresponding to the concept, and the vector corresponding to this vertex.

As a conceptual model, the main task of ontology is to manage concepts and information mining. Therefore, the similarity calculation between concepts is the core of ontology application in various engineering fields. Specifically, for the ontology vertices v1 and v2, it is necessary to characterize the measurement of sim(v1,v2). Since the vertices are denoted by vectors, the similarity between vertices can be regarded as the similarity between two vectors in high-dimensional space.

A learning technique based on the dimension descent method is to map each ontology vector into a real number, thereby mapping the entire ontology graph to the real axis, and the similarity between the ontology vertices is obtained by their one-dimensional distance on the real axis. Specifically, let f:p be an ontology function that maps ontology concept vectors into real numbers. The similarity between the two ontology vertices v1 and v2 is measured by |f(v1)−f(v2)|, and the larger the value is, the smaller the similarity between v1 and v2 becomes, and on the contrary, the smaller the value of |f(v1)−f(v2)|, the larger the similarity between two vertices.

Therefore, in the standard ontology learning setting, the ontology procedure can be described as follows. Let Vp (p≥1) be an instance space (or called the input space) for ontology graph, and the vertices in V are drawn independently and randomly follow to a certain unknown distribution \(\mathcal {D}\). The aim of ontology learning algorithms is to deduce an optimal ontology function f:V using the given ontology training set S={v1,⋯,vn} of size n.

Multi-dividing ontology setting

This framework of multi-dividing ontology learning algorithm is based on the fact that most ontology graph structures are tree structures (acyclic graphs). Forming several branches of the tree below the topmost vertex, if we classify all ontology concepts using a classification algorithm, we find that the vertices in each branch correspond exactly to a class of vertex classification. It means that the similarity between the vertices of the same branch is higher than the similarity between the vertices from different branches. After mapping all ontology vertices to the real axis, it can be observed that the real numbers corresponding to the vertices of the same branch have an aggregation effect on the one-dimensional axis (it can be understood that the vertices of each branch form a one-dimensional cluster on the axis). In light of this observation, we have reason to assume that the real numbers corresponding to the same branch vertices are in the same interval of the real axis. We can imagine cutting the entire real axis into k-breaks (here k represents the number of branches under the top vertex of the ontology graph), with all the vertices of each branch happening within a certain break. In the following contexts, we always assume k,a,b are positive integers.

Now, we formally describe the multi-dividing ontology algorithm. All the ontology vertices are divided into k parts which is corresponding to k branches in the ontology graph, and we endow rate number of these k parts of vertices, denoted by 1,2,⋯,k (note that the rate values of all parts are determined by domain experts who have deep domain knowledge related to ontology in certain engineering application). Assume that f(va)>f(vb) where f is an ontology function, va belongs to rate a vertices, vb belongs to rate b vertices, and 1≤a<bk. It reveals that under the target ontology function, the value of high rate vertex is bigger than the value of low rate vertex.

Correspondingly, the ontology training sample in multi-dividing ontology setting is denoted as \(S=(S_{1},S_{2},\cdots,S_{k})\in V^{n_{1}}\times V^{n_{2}}\times \cdots \times V^{n_{k}}\phantom {\dot {i}\!}\) which consists of a sequence of training sample \(S_{a}=(v_{1}^{a},\cdots,v_{n_{a}}^{a})\phantom {\dot {i}\!}\) belongs to rate a (here 1≤ak). The ontology learner is given such an ontology sample S and aim to learn a real-valued ontology score function f:V (or f:p) that the value of Sa vertices are bigger than the value of Sb vertices if 1≤a<bk. Suppose that vertices in each Sa (here 1≤ak) are drawn independently and randomly according to certain unknown distribution \(\mathcal {D}_{a}\) on the instance space V respectively. On the other hand, since each vertex \(v_{i}^{a}\) or \(v_{j}^{b}\) is a p-dimensional vector, we set \(v_{i}^{a}=\left (\left (v_{i}^{a}\right)_{1},\cdots,(v_{i}^{a})_{p}\right)\), \(v_{j}^{b}=\left (\left (v_{j}^{b}\right)_{1},\cdots,\left (v_{j}^{b}\right)_{p}\right)\) with i∈{1,⋯,na} and j∈{1,⋯,nb}.

Let I(·) be the binary truth function (it also named 0-1 function or 0-1 loss). Then, the ontology learning algorithm in area under the receiver operating characteristic curve criterion can be formulated by

$$ \widehat{A}(f,S)=\sum_{a=1}^{k-1}\sum_{b=a+1}^{k}\frac{1}{n_{a}n_{b}}\sum_{i=1}^{n_{a}}\sum_{j=1}^{n_{b}}I(f(v_{i}^{a})>f(v_{j}^{b})). $$
(1)

Here we need to explain and state the following points:(1) The optimal ontology function is obtained by maximizing \(\widehat {A}(f,S)\).(2) The standard multi-dividing learning algorithm can be stated as

$${\begin{aligned} \widehat{A}(f,S)=\sum_{a=1}^{k-1}\sum_{b=a+1}^{k}\frac{1}{n_{a}n_{b}}\sum_{i=1}^{n_{a}}\sum_{j=1}^{n_{b}}\left\{I(f(v_{i}^{a})>f(v_{j}^{b}))\right. \\ \left. +\frac{1}{2}I(f(v_{i}^{a})=f(v_{j}^{b}))\right\}. \end{aligned}} $$

Clearly, our ontology framework omitted the \(\frac {1}{2}I(f(v_{i}^{a})=f(v_{j}^{b}))\) part in each of the accumulated items. (3) The expected ontology model of (1) is denoted as

$$ A(f)=\sum_{a=1}^{k-1}\sum_{b=a+1}^{k}\Bbb E_{V_{a}\sim \mathcal{D}_{a},V_{b}\sim \mathcal{D}_{b}}I(f(V_{a})>f(V_{b})). $$
(2)

Remember that in the ontology sparse vector setting, the ontology function can be concrete represented as

$$ f(v)=\sum_{t=1}^{p}v_{t}\beta_{t}+\beta_{0}=v\beta^{T}+\beta_{0}, $$
(3)

where β=(β1,⋯,βp) is ontology sparse vector such that most of its components are supposed to be zero, and β0 is an offset item. In many circumstances, we ignore β0 and consider \(f(v)=\sum _{t=1}^{p}v_{t}\beta _{t}\). It’s general expand expression can be stated as \(f(v)=\sum _{t=1}^{p}g_{t}(v_{t})\), where gt is some kind of function (obviously, in the very special case of ontology sparse vector setting, gt(vt)=vtβt).

Return to the standard framework with offset term, the general ontology model (3) can be written by

$$ f(v)=\sum_{t=1}^{p}g_{t}(v_{t})+\beta_{0}. $$
(4)

In this paper, we consider the linear combination setting where function gt can be formulated as \(g_{t}(\cdot)=\sum _{q=1}^{d}\beta _{tq}\phi _{q}(\cdot)\) where ϕq(·) are basis functions from (4). Set

$$\begin{array}{@{}rcl@{}} &\quad&\Delta(v_{i}^{a},v_{j}^{b})=f(v_{i}^{a})-f(v_{j}^{b})\\ &=&\sum_{t=1}^{p}(g_{t}((v_{i}^{a})_{t})-g_{t}((v_{j}^{b})_{t}))\\ &=&\sum_{t=1}^{p}\sum_{q=1}^{d}\beta_{tq}(\phi_{q}((v_{i}^{a})_{t})-\phi_{q}((v_{j}^{b})_{t})) \end{array} $$
(5)

as the difference between value of two ontology functions of \(v_{i}^{a}\) and \(v_{j}^{b}\). Thus, the expected version can be re-stated as

$$ A(f)=\sum_{a=1}^{k-1}\sum_{b=a+1}^{k}\Bbb P(\Delta(V_{a},V_{b})>0), $$
(6)

where \(V_{a}\sim \mathcal {D}_{a}\) belongs to rate a and \(V_{b}\sim \mathcal {D}_{b}\) belongs to rate b. Once a special combination of (a,b) is fixed, then denote Aa,b(f)=(Δ(Va,Vb)>0).

Accordingly, the ontology empirical framework with ontology sample set \(\phantom {\dot {i}\!}S=(S_{1},S_{2},\cdots,S_{k})\in V^{n_{1}}\times V^{n_{2}}\times \cdots \times V^{n_{k}}\) is re-formulated by

$$ \widehat{A}(f,S)=\sum_{a=1}^{k-1}\sum_{b=a+1}^{k}\frac{1}{n_{a}n_{b}}\sum_{i=1}^{n_{a}}\sum_{j=1}^{n_{b}}I(\Delta(v_{i}^{a},v_{j}^{b})>0). $$
(7)

It is not hard to verify that \(\widehat {A}(f,S)\) is an unbiased estimator of A(f), that is to say, \(\Bbb E[\widehat {A}(f,S)]=A(f)\). Therefore, the ontology risk and ontology empirical risk are defined as \({k\choose 2}-A(f)\) and \({k\choose 2}-\widehat {A}(f,S)\), respectively.

Our main result will be presented in next section which characterizes the convergence property under condition that each ni is a big number (here i∈{1,⋯,k}) in the multi-dividing ontology setting.

Main result and proof

In this section, we manifest our result and the detailed proof is based on the statistical skills.

Hypothesis space is an important factor in statistical learning theory. The ontology algorithm can’t converge if the space is too large, while the resulting optimal ontology function does’t have excellent statistical properties if the hypothesis space is too small. A crucial point in the proof technique is to control the measure of the hypothesis space to achieve a certain degree of balance. Here, we set

$$ \mathcal{F}_{d}=\{f:f(v)=\sum_{t=1}^{p}g_{t}(v_{t})=\sum_{t=1}^{p}\sum_{q=1}^{d}\beta_{tq}\phi_{q}(v_{t})\} $$
(8)

as hypothesis space in our setting where ϕq(·) with q∈{1,⋯,d} are basis functions.

For each pair of (a,b) with 1≤a<bk, set \(\frac {n_{a}}{n_{b}}\to c^{a,b}\). Our main result is stated as follows.

Theorem 1

Assume ca,b>0 for each pair of (a,b) with 1≤a<bk and \(\sum _{n_{a}=1}^{\infty }n_{a}^{2dp}\exp \{-\frac {n_{a}\varepsilon ^{2}}{8}\}<\infty \) for any a∈{1,⋯,k−1} and ε>0, then

$$ |\sup_{f\in\mathcal{F}_{d}}\widehat{A}(f,S)-\sup_{f\in\mathcal{F}_{d}}A(f)|\to 0 $$
(9)

holds almost everywhere.

Proof of Theorem 1. Our proof techniques depend heavily on Hoeffding inequality, Borel-Cantelli lemma and statistical property of shatter coefficient. For any combination (a,b) with 1≤a<bk, set \(A_{i}^{a,b}=\frac {1}{n_{b}}\sum _{j=1}^{n_{b}}I(\Delta (v_{i}^{a},v_{j}^{b})>0)\) where i∈{1,⋯,na}. Hence, the ontology empirical version can be re-written as

$$\widehat{A}(f,S)=\sum_{a=1}^{k-1}\sum_{b=a+1}^{k}\frac{1}{n_{a}}\sum_{i=1}^{n_{a}}A_{i}^{a,b}.$$

Notice that \(A_{i}^{a,b}\) are all independent with \(A_{i}^{a,b}\in [0,1]\) for any combination (a,b) with fixed \(v_{1}^{b},\cdots,v_{n_{b}}^{b}\), where 1≤a<bk. Thus, in terms of Hoeffding Theorem, we infer

$$\begin{array}{@{}rcl@{}} {{}\begin{aligned} &\quad\Bbb P(|\widehat{A}(f,S)-A(f)|>\varepsilon)\\ &=\sum_{a=1}^{k-1}\sum_{b=a+1}^{k}\Bbb P\left(|\frac{1}{n_{a}}\sum_{i=1}^{n_{a}}A_{i}^{a,b}-A^{a,b}(f)|>\varepsilon\right)\\ &=\sum_{a=1}^{k-1}\sum_{b=a+1}^{k}\Bbb P\left(|\sum_{i=1}^{n_{a}}(A_{i}^{a,b}-A^{a,b}(f))|>n_{a}\varepsilon\right)\\ &=\sum_{a=1}^{k-1}\sum_{b=a+1}^{k}\Bbb P\left(|\sum_{i=1}^{n_{a}}(A_{i}^{a,b}-\Bbb E(A_{i}^{a,b}|v_{1}^{b},\cdots,v_{n_{b}}^{b})\right)\\ &\quad+\sum_{i=1}^{n_{a}}\left(\Bbb E\left(A_{i}^{a,b}|v_{1}^{b},\cdots,v_{n_{b}}^{b}\right)-A^{a,b}(f))|>n_{a}\varepsilon\right)\\ &\le\sum_{a=1}^{k-1}\sum_{b=a+1}^{k}\Bbb P\left(|\sum_{i=1}^{n_{a}}\left(A_{i}^{a,b}-\Bbb E\left(A_{i}^{a,b}|v_{1}^{b},\cdots,v_{n_{b}}^{b}\right)\right)|\ge\frac{n_{a}\varepsilon}{2}\right)\\ &\quad+\sum_{a=1}^{k-1}\sum_{b=a+1}^{k}\Bbb P\left(|\sum_{i=1}^{n_{a}}\left(\Bbb E\left(A_{i}^{a,b}|v_{1}^{b},\cdots,v_{n_{b}}^{b}\right)-A^{a,b}(f)\right)|>\frac{n_{a}\varepsilon}{2}\right)\\ &=(I)+(II). \end{aligned}} \end{array} $$

For the first part (I), in view of Hoeffding inequality, we deduce

$$\begin{array}{@{}rcl@{}} {{}\begin{aligned} (I)&=\sum_{a=1}^{k-1}\sum_{b=a+1}^{k}\Bbb P\left(\left|\sum_{i=1}^{n_{a}}\left(A_{i}^{a,b}-\Bbb E\left(A_{i}^{a,b}|v_{1}^{b},\cdots,v_{n_{b}}^{b}\right)\right)\right|\ge\frac{n_{a}\varepsilon}{2}\right)\\ &=\sum_{a=1}^{k-1}\sum_{b=a+1}^{k}\Bbb E\left\{\Bbb P\left(\left|\sum_{i=1}^{n_{a}}\left(A_{i}^{a,b}-\Bbb E\left(A_{i}^{a,b}|v_{1}^{b},\cdots,v_{n_{b}}^{b}\right)\right)\right|\right. \right.\\&\quad \left. \left. \ge\frac{n_{a}\varepsilon}{2}|v_{1}^{b},\cdots,v_{n_{b}}^{b}{\vphantom{\sum_{a=1}^{k-1}\sum_{b=a+1}^{k}}}\right)\right\}\\ &\le\sum_{a=1}^{k-1}\sum_{b=a+1}^{k}2\Bbb E\left\{\exp\left\{-\frac{n_{a}\varepsilon^{2}}{8}\right\}|v_{1}^{b},\cdots,v_{n_{b}}^{b}\right\}\\ &=\sum_{a=1}^{k-1}\sum_{b=a+1}^{k}2\exp\left\{-\frac{n_{a}\varepsilon^{2}}{8}\right\}. \end{aligned}} \end{array} $$

Using the same fashion, the second part (II) can be similar bounded to

$$\begin{array}{@{}rcl@{}} (II)&=&\sum_{a=1}^{k-1}\sum_{b=a+1}^{k}\Bbb P\left(|\sum_{i=1}^{n_{a}}\left(\Bbb E\left(A_{i}^{a,b}|v_{1}^{b},\cdots,v_{n_{b}}^{b}\right)\right.\right. \\&&\quad \left.\left.-A^{a,b}(f)\right)|>\frac{n_{a}\varepsilon}{2}\right)\\ &\le&\sum_{a=1}^{k-1}\sum_{b=a+1}^{k}2\exp\left\{-\frac{n_{a}\varepsilon^{2}}{8}\right\}. \end{array} $$

By combining above two parts, we yield

$$ {\begin{aligned} &\Bbb P\left(|\widehat{A}(f,S)-A(f)|>\varepsilon\right)\le(I)+(II)\\&\quad\le4\sum_{a=1}^{k-1}\sum_{b=a+1}^{k}\exp\left\{-\frac{n_{a}\varepsilon^{2}}{8}\right\}. \end{aligned}} $$
(10)

In terms of \(\sum _{n_{a}=1}^{\infty }\Bbb P(|\widehat {A}(f,S)-A(f)|>\varepsilon)<\infty \) and Borel-Cantelli lemma, we know that \(|\widehat {A}(f,S)-A(f)|\to 0\) holds almost everywhere. Since the VC dimension of \(\mathcal {F}_{d}=\{f:f=\beta _{0}+\sum _{t=1}^{p}\sum _{q=1}^{d}\beta _{tq}\phi _{q}\}\) is dp+1 and there are \(\sum _{a=1}^{k-1}\sum _{b=a+1}^{k}n_{a}n_{b}\) U-statistic kind of observations, \(\Delta (v_{i}^{a},v_{j}^{b})\) for 1≤a<bk, i∈{1,⋯,na} and j∈{1,⋯,nb}, the shatter coefficient (the standard definition of shatter coefficient in multi-dividing ontology setting can be stated as the same as it in the k-partite ranking setting, for more details see Gao and Wang [35]) of the linear ontology function space \(\mathcal {F}_{d}\) can be bounded by \(\sum _{a=1}^{k-1}\sum _{b=a+1}^{k}\{2+2(n_{a}n_{b}-1)^{dp}\}\le \sum _{a=1}^{k-1}\sum _{b=a+1}^{k}3(n_{a}n_{b})^{dp}\). In light of (10) and this upper bound of shatter coefficient, we verity that

$$ {\begin{aligned} &\Bbb P\left(\sup_{f\in\mathcal{F}_{d}}|\widehat{A}(f,S)-A(f)|>\varepsilon\right)\\&\quad\le12\sum_{a=1}^{k-1}\sum_{b=a+1}^{k}(n_{a}n_{b})^{pd}\exp\left\{-\frac{n_{a}\varepsilon^{2}}{8}\right\}. \end{aligned}} $$
(11)

According to the assumption in theorem that \(\frac {n_{a}}{n_{b}}\to c^{a,b}>0\) with 1≤a<bk, we see that for each pair of (a,b), na increases at the same rate of nb. It indicates that when the ontology vertex number is increasing, the number of vertices in each branch will grow relatively evenly (in graph theory, such structure called nearly balanced tree). Thus, if \(\frac {n_{a}}{n_{b}}\to c^{a,b}>0\), then \(\sum _{a=1}^{k-1}\sum _{b=a+1}^{k}\sum _{n_{a},n_{b}}(n_{a}n_{b})^{pd}\exp \left \{-\frac {n_{a}\varepsilon ^{2}}{8}\right \}<\infty \) is equivalent to \(\sum _{a=1}^{k-1}\sum _{b=a+1}^{k}\sum _{n_{a}}n_{a}^{2dp}\exp \left \{-\frac {n_{a}\varepsilon ^{2}}{8}\right \}<\infty \) which is acted as known condition. Therefore, by means of \(\frac {n_{a}}{n_{b}}\to c^{a,b}>0\) for each pair of (a,b) with 1≤a<bk and Borel-Cantelli Lemma, we confirm that

$$ \sum\Bbb P(\sup_{f\in\mathcal{F}_{d}}|\widehat{A}(f,S)-A(f)|>\varepsilon)<\infty $$
(12)

and also derive

$$ \sup_{f\in\mathcal{F}_{d}}|\widehat{A}(f,S)-A(f)|\to 0 $$
(13)

holds almost everywhere.

Finally, we need to show that \(\sup _{f\in \mathcal {F}_{d}}\widehat {A}(f,S)-\sup _{f\in \mathcal {F}_{d}}A(f)\to 0\) almost surely. Set

$$\widehat{f}^{*}=\mathop{\arg\max}_{f\in\mathcal{F}_{d}}\widehat{A}(f,S)$$

and

$$f^{*}=\sup_{f\in\mathcal{F}_{d}}A(f).$$

Hence, the corresponding area under the receiver operating characteristic curve criterion are stated as

$$\sup_{f\in\mathcal{F}_{d}}\widehat{A}(f,S)=\widehat{A}(\widehat{f}^{*},S)$$

and

$$\sup_{f\in\mathcal{F}_{d}}A(f)=A(f^{*}).$$

Combining all these facts, we get

$$\begin{array}{@{}rcl@{}} &\quad&|\sup_{f\in\mathcal{F}_{d}}\widehat{A}(f,S)-\sup_{f\in\mathcal{F}_{d}}A(f)|\\ &=&|\widehat{A}(\widehat{f}^{*},S)-A(f^{*})|\\ &\le&|\widehat{A}(\widehat{f}^{*},S)-A(\widehat{f}^{*})|+|A(\widehat{f}^{*})-A(f^{*})|. \end{array} $$
(14)

By means of \(A(\widehat {f}^{*})\le A(f^{*})\), the second term in (14) can be decomposed into two terms and then dealt with as follows

$$\begin{array}{@{}rcl@{}} &\quad&|A(\widehat{f}^{*})-A(f^{*})|\\ &=&A(f^{*})-A(\widehat{f}^{*})\\ &=&A(f^{*})-\widehat{A}(\widehat{f}^{*},S)+\widehat{A}(\widehat{f}^{*},S)-A(\widehat{f}^{*})\\ &\le&A(f^{*})-\widehat{A}(f^{*},S)+\widehat{A}(\widehat{f}^{*},S)-A(\widehat{f}^{*})\\ &\le&2\sup_{f\in\mathcal{F}_{d}}|\widehat{A}(f,S)-A(f)|. \end{array} $$
(15)

From (13), (14) and (15), we obtain

$$\begin{array}{@{}rcl@{}} &\quad&|\sup_{f\in\mathcal{F}_{d}}\widehat{A}(f,S)-\sup_{f\in\mathcal{F}_{d}}A(f)|\\ &\le&|\widehat{A}(\widehat{f}^{*},S)-A(\widehat{f}^{*})|+|A(\widehat{f}^{*})-A(f^{*})|\\ &\le&|\widehat{A}(\widehat{f}^{*},S)-A(\widehat{f}^{*})|+2\sup_{f\in\mathcal{F}_{d}}|\widehat{A}(f,S)-A(f)|\\ &\le&3\sup_{f\in\mathcal{F}_{d}}|\widehat{A}(f,S)-A(f)|\to 0 \end{array} $$

holds almost everywhere. □

Theorem 1 still establishes even if d and p are admitted to increase relying on the ontology sample capacity. In this case, the hypothesis space \(\mathcal {F}_{d}\) increases as d and p grows. However, the rates of d and p rely on the combination of (n1,⋯,nk), which implies that the dimension p has slower increasing rate than ontology sample capacities. Theorem 1 reveals that if p is much larger than ontology sample capacities, then the ontology empirical risk minimization algorithm framework may not obtain a desired performance. To our delight, in certain structural assumption settings (for instance, sparsity), we have enough reasons to construct an optimal ontology rule if some offset terms are eliminated when p is much larger than ontology sample capacities.

Since 0-1 ontology loss is a non-derivable function, it is difficult minimizing \({k\choose 2}-\widehat {A}(f,S)\) in practice, and it is natural to apply some approximation tricks based on the smooth ontology function such as logistic ontology function \(\Lambda _{\tau }(x)=\frac {\exp \{-x\tau \}}{1+\exp \{-x\tau \}}\) where τ is a positive number which is used to control how steep the logistic ontology function is around zero. In light of Λτ(x), we consider the Φ(f) approximation to \({k\choose 2}-\widehat {A}(f,S)\) in the multi-dividing ontology setting, i.e.,

$$\begin{array}{@{}rcl@{}} &\quad&\Phi(f)=\sum_{a=1}^{k-1}\sum_{b=a+1}^{k}\frac{1}{n_{a}n_{b}}\sum_{i=1}^{n_{a}}\sum_{j=1}^{n_{b}}\Lambda_{\tau}\left(\Delta(v_{i}^{a},v_{j}^{b})\right)\\ &=&\sum_{a=1}^{k-1}\sum_{b=a+1}^{k}\frac{1}{n_{a}n_{b}}\sum_{i=1}^{n_{a}}\sum_{j=1}^{n_{b}}\frac{\exp\left\{-\Delta(v_{i}^{a},v_{j}^{b})\tau\right\}}{1+\exp\left\{-\Delta(v_{i}^{a},v_{j}^{b})\tau\right\}} \end{array} $$

for some positive number τ. To Minimize Φ(f) in \(\mathcal {F}_{d}\) is equivalent to that in terms of γp×d, and thus it can be re-formulated as Φ(f)=Φ(γ) where γ=(γ1,⋯,γp) and γt=(γt1,⋯,γtd) for t∈{1,⋯,p}. Therefore, the minimizer of Φ(γ) can be numerically identified since Φ(γ) is the smooth function of γ. Theorem 1 indicates that the optimization of 0-1 ontology loss is ensure to yield the best character with certain restrictions on ontology sample dimensions and capacities. But, generally speaking, it is not practical to deduce the solution of minimizing 0-1 ontology loss, in particular in high dimensional settings. In all, it’s a sensible and popular way to apply a smooth ontology function to approximate 0-1 ontology loss which is thought to be an intelligent approximation although its optimality of the approximation is still unknown.

Now, we elaborately explain the weak points of linear combination multi-dividing ontology learning algorithm, that is, under what circumstances it is not suitable.

  • It can be seen from the ontology learning models (1), (2), (6), (7) that the technology to achieve dimensionality reduction comes from the pairwise comparison of the ontology sample vertices. The weakness is that only two ontology vertices can be extracted at one time for comparison, which results in the number of vertex pairs to be compared in the optimization model becoming very large as the sample capacity increases.

  • The depth of the vertex in the ontology graph is defined as the distance between this vertex and the top vertex, and the depth of the ontology graph is generally defined as the depth of the deepest vertex. As the depth increases, the concept of the ontology will become more and more detailed, and the similarity between the adjacent vertices of the upper and lower layers will become greater. Conversely, the smaller the number of layers is, the larger the span of the ontology concept is, and the smaller the similarity of the upper and lower concepts will become. This is what we often say about the structure distribution of the ontology graph. Again, back to look at our multi-dividing ontology algorithm in linear combination setting, it cannot reflect this structure characteristic of the ontology graph since each pair of ontology vertices comparison is from different rate of branches.

Experiment

In this section, we mainly focus on the effectiveness of the algorithm in some specific fields from experimental point of view. The ontology data we used here are all tree-structures (or close to tree-structures) in order to fit the multi-dividing setting, and we aim to investigate the similarity-based ontology mapping between two different ontology trees in the same application field. The entire execution process can be described as follows: first, for two university ontology graphs or two mathematical ontology graphs, domain experts determine the most similar N vertices in another ontology of each vertex (here N=1,3,5), and they are marked as the target similarity vertex set of each vertex; then, by means of our linear combination multi-dividing ontology algorithm, we calculate the real number corresponding to each vertex, and record the most similar N vertices in corresponding ontology; for each vertex, comparing the similarity set given by the expert and obtained by the algorithm calculating, and compute the matching rate; finally, the average matching rate of the entire ontology graph is calculated in light of the matching rate of all vertices in the two ontology graphs.

Experiment on university data

University ontologies are very well-known ontologies, which often appear in some explanations about ontology introductory textbooks and examples, and the structure of two university ontologies O1 and O2 are depicted in Fig. 1.

Fig. 1
figure 1

University ontologies O1 and O2

In our multi-dividing linear combination setting, k=3 and three branches correspond to “course”, “student” and “staff”. It is clear that |V(G)|=28 where G is the union of two subgraphs and the same concept in different subgraphs can be regraded as different vertices. We take 14 vertices as ontology sample set from the whole vertex set. In order to compare it with other ontology learning algorithms, we compare the experimental data (some parts of these data have been already presented in the previous articles), and part of the results are as follows.

From the comparison of the data in Table 1, we can see that our linear combination multi-dividing ontology learning algorithm is significantly better than the previous three algorithms for the average accuracy of university ontology.

Table 1 Comparing result on university ontology when N=1,3,5

Experiment on mathematical data

Mathematical ontologies are constructed in mathematical education and used to provide mathematical knowledge for graduate students in the field of discrete mathematics, and as the first experiment, our aim is to build a bridge between the following two mathematical ontology graphs based on similarity computing between ontology vertices. The structure of two mathematical ontologies O3 and O4 are depicted in Fig. 2.

Fig. 2
figure 2

Mathematical ontologies O3 and O4

Although we found that the graph structures of O3 and O4 are not tree, their structures are very close to the tree-shaped acyclic graph structure, and thus can be treated as a tree structure after simple processing. After analysis we take k=4, and it is clear that |V(G)|=26. We take half of vertices as ontology sample, i.e., |S|=13. Similarly, to compare it with other ontology learning algorithms, we directly use the experimental data which were presented in the [24, 25] and [27]. Furthermore, we test the accuracy of “confidence weighted ontology algorithm” presented in [37] and “weak function based ontology learning algorithm” manifested in [38], and compare to our ontology learning algorithm. Part of the results are as follows.

From the comparison results manifested in Table 2, we acquire that the linear combination multi-dividing ontology learning algorithm proposed in our paper has higher efficiency than the previous three algorithms for the average accuracy of mathematical ontology.

Table 2 Comparing result on mathematical ontology when N=1,3,5

In the above two comparative experiments, we believe that the reason why the data of the ontology learning algorithm in this paper is superior to the data of other algorithms lies in that our algorithm is designed for the tree structure, while focuses and goals achieved in the engineering field of other ontology learning algorithms are differently designed. The “University” ontology is a pure tree structure, and although the “Mathematical” ontology is not strictly acyclic graph, it can also be processed and divided according to the tree structure. In contrast to other ontology learning algorithms compared in experiments, some of them are not designed for tree structures, and some of them use different angles to design algorithms. For instances, (1) although the confidence weighted ontology algorithm in [37] is also designed under a multi-dividing framework, its purpose is to save space complexity, and its core algorithm is a buffer update strategy, not an iteration of ontology functions; (2) disequilibrium ontology learning in [27] is also presented in multi-dividing ontology learning setting, while it focuses on the balance between the data rather than the structure of the ontology graph. In general, the efficiency of the algorithm in this paper reflects its advantages over tree-structured ontology graphs.

Conclusion

As a powerful auxiliary tool, the ontology has penetrated into various research fields such as chemistry, genetics, and pharmacy, providing technical support to scientists from all walks of life. In the process of ontology construction, scholars found that most ontology uses tree structure to represent the hierarchical relationship and derivative relationship of concepts. It can be said that tree structure is the most suitable structural representation of ontology concepts. Based on this fact, researchers proposed several multi-dividing ontology learning algorithms, which specifically divide the categories of vertices for multiple branches of the ontology tree structure. The existing experimental data can fully explain that the multi-dividing ontology algorithms have higher efficiency for some well-known application ontologies (such as “GO”, “PO”, etc).

In this article, we only focus on the theoretical analysis of ontology learning algorithm. The approximation property of the multi-dividing ontology learning algorithm is analyzed from the perspective of statistical learning theory, and the result shows that the algorithm has very good approximation properties in the linear combination setting.

We list some open problems as the end of this paper:

  • How to use the covering number to characterize the properties of the hypothesis space, and then obtain the theoretical boundary of the covering number approximation in multi-dividing ontology learning setting.

  • What will happen if we assume the ontology tree is not balanced (ni are not increasing at the same rate, where i∈{1,⋯,k})?

  • Find a suitable assumption to ensure the ontology function satisfy the “Uniform Glivenko-Cantelli” characteristic in multi-dividing ontology learning setting.