1 Introduction

Data science has become an extensively investigated discipline, where diverse methods are applied to identify and extract actionable information.

Mathematical areas including Network Theory, Dynamical Systems, Topology, as well as sophisticated algebraic approaches have been introduced, which have demonstrated their potential in addressing data complexity and their applications have led to the design of effective algorithms (Trovati et al. 2020; Shao et al. 2017; Ray and Trovati 2018; Xu et al. 2019; Ray et al. 2018; Trovati et al. 2022).

In this work, an approach based on the topological and information theoretic properties of networks is introduced, which aims to assess and analyse semantic features and and their reduction. The main motivation follows from the recent advance in Topological Data Analysis (TDA) which has been shown to identify invariant data properties that can be used to assess and analyse large quantities of unstructured data, otherwise difficult to investigate via standard approaches (Carlsson 2008).

We define a knowledge system as a set of concepts and mutually connecting semantic properties, which naturally define semantic networks. More specifically, they are defined as networks \(G=G(V,E)\), where \(V = \{ v_{i}\}_{i=1}^{n}\) is the node-set and \(E = \{ e(v_{i}, v_{j}) \}_{i\not = j=1}^{n}\) is the edge set; each edge can be represented as a multi-dimensional vector based on a predefined set of (semantic) relations. In other words, multiple relations typically exist between two concepts. Furthermore, each node is embedded into an multi-dimensional feature space. It is important to specify that features are assumed to be semantic, that is well defined relational concepts, which can be quantified based on their attributes. These properties define semantic features, which are the central part to this work. Loosely speaking, they refer to semantic properties, which are linked to (potentially multiple) edges, or embedded into the nodes. In particular, we will assume that each component of every edge can be evaluated within the interval [0, 1] via a weight \(w(e(v_{i}, v_{j})): E \mapsto [0,1]\) of an edge \(e(v_{i}, v_{j}) \in E\).

As discussed above, each edge contains specific information embedded in an \(n-\)dimensional feature space \(\mathcal {R}\subset [0,1]^{n}\). These features are typically captured by multiple edges between nodes, which can be flattened into a single edge. We will assume that such evaluation and assessment can be carried out independently from the approach introduced in this work, and as part of any suitable data pre-processing process. Furthermore, it is also assumed that an appropriate measure to capture the potentially multiple semantic links between pairs of nodes has been carried out.

As a consequence, each node \(v_i \in V\) can be represented as an element of \([0,1]^{n}\). Furthermore, every edge is associated with a map \(R_{k}: x \rightarrow y\) for the \(k-\)th feature in \(\mathcal {R}\) linking the nodes x and \(y \in V\). Finally, let

$$\begin{aligned} n(x \leftrightarrow y) = \{R_{1},\ldots ,R_{n}\} \subset \mathcal {R}, \end{aligned}$$
(1)

be the set of relations shared by the nodes x and y.

This article is structured as follows: in Sect. 2, existing research and methods are discussed. Sections 3 and 4 introduce the main background information and the main results, respectively. Finally, Sect. 5 concludes the article and prompts to new directions.

2 Related works

Much of the research in this area is focused on the extraction of information that may be utilised to make decisions from unstructured data (Chazal and Michel 2017). In fact, research and industry organisations, as well as general human activities, generate massive amount of data, which poses significant challenges, as well as significant opportunities. A major problem is comprehending and evaluating the importance and worth of the useful information that is included within this abundance of data.

With an emphasis on the structure and shape of data in terms of the relationships between its components, new theoretical methods based on topological techniques have been created (Carlsson 2008). The capacity to categorise things based on their shared characteristics, or in other words, those displaying invariant qualities, is one property of topology. The geometrical characteristics of the data under investigation are crucial in many clustering and classification techniques. In reality, a large portion of these methods concentrate on the idea of distance, which must be described in terms of the mathematical spaces in which the data is contained.

Persistent homology, which focuses on identifying the topological qualities that remain invariant (Edelsbrunner and Harer 2008), has been drawing increasing attention from the research community due to its applicability to AI, machine learning and data science in general. Simplicial complexes are the fundamental components in persistent topology. These are topological space triangulations consisting of coupled, non-overlapping polyhedra, as depicted in Fig. 1.

Fig. 1
figure 1

The graphical representation of simplicial complexes as described in Zhang et al. (2020)

Image pixellation is a simple but instructive example of a space triangulation in which a genuine image is covered in pixels to provide an accurate approximation.

The fact that a simplicial complex provides an ‘approximation’ of the object which is covered via a triangulation. Triangulations include, but are not limited to Voronoi diagrams, Delaunay triangulations, Vietoris complexes, and C̆ech complexes. In general, they are defined in terms of a certain distance or by ball intersections, whose centres are the data points. Additionally, the adjacency graphs produced by these triangulations can provide a number of details about their invariant topological characteristics, making them pertinent to data science (Carlsson 2008). Refer to Edelsbrunner and Harer (2008) for further information on simplicial complexes.

3 Observability

In this section, the concept of observability is introduced. The main motivation is that information is inter-twinned with its observability. In other words, information must be observed. We define the following types of observability which, as explained above, are embedded in a knowledge system, or in other words, a suitably defined semantic network.

  • Single object observability: O(x) measures the observability of x with respect to the other nodes. More specifically, we have

    • Absolutely observability: this is a single observability which covers the entire network

    • Local observability: similar to the above but covers only a subnetwork containing x

  • Mutual observability: this occurs between two (semantic) objects x, and y, and it is defined as O(xy).

In the rest of this section, a formal definition of the relevant observabilities derived from the above instances, is introduced.

Definition 1

(Observabilities) Let \(G = G(V,E)\) be a (semantic) network (knowledge system) and let \(x,y \in V\) be two nodes, and \(e \in E\) be an edge. Assume that n(x) is the set of incidents edges with respect to x (or the degree of x), so that \(n(x \leftrightarrow y)\) is the set of (undirected) edges connecting x with y. Let \(w(e) \in (0, 1]\) be the weight of the edge e. We define the observability from x to y as

$$\begin{aligned} O(x \rightarrow y) = -\log {\left( \frac{w(x \leftrightarrow y)}{w(x)} \right) }, \end{aligned}$$
(2)

where

$$\begin{aligned} w(x \leftrightarrow y) = \sum _{e_{i} \in n(x \leftrightarrow y)} w(e_{i}) \end{aligned}$$
(3)

and

$$\begin{aligned} w(x) = \sum _{e_{j} \in n(x)} w(e_{j}). \end{aligned}$$
(4)

In this article, a simplified version of observability, the naïve observability of x, will be used, which is defined as

$$\begin{aligned} O_{n}(x) = - \log {\left( \frac{w(x)}{\sum _{y \in V(\tilde{G})} w(y)} \right) }. \end{aligned}$$
(5)

The difference between the naïve observability \(O_{n}(x)\) and those described above is that the former only consider the level of connectivity of x with its neighbours compared to all the other nodes in the network G. In other words, this type of observability is only affected by the immediate connections of x, as opposed to the overall topology of G and the corresponding connected components and paths. Despite this, it proves to provide a useful tool in the analysis of the results in the remaining of the article.

It is trivial to see that \(O(x \rightarrow y)\) is a premetric as in general it is not symmetric and the triangle inequality does not hold.

Let \(\underline{x} = x_{1}x_{2}\cdots x_{n}\) be a path joining a set of incident nodes. We can easily see that the observability of such a path is

$$\begin{aligned} O(\underline{x}) = \sum _{i=1}^{n-1} O(x_{i},x_{i+1}) \end{aligned}$$
(6)

In particular, using Eq. 2 this can be written as

$$\begin{aligned} O(\underline{x})= & {} -\log {\left( \frac{w(x_{1} \leftrightarrow x_{2})}{w(x_{1})} \cdots \frac{w(x_{n-1} \leftrightarrow x_{n})}{w(x_{n-1}) } \right) } \nonumber \\= & {} -\log {\left( \prod _{i=1}^{n-1} \frac{ w(x_{i} \leftrightarrow x_{i+1})}{w(x_{i})} \right) } \end{aligned}$$
(7)

3.1 Probability as a function of observability

Any probability measure defined over complex networks is closely related to their overall topology, which usually depends on the corresponding degree distribution.

Based on the above and on general network properties (Newman 2010), we have that

$$\begin{aligned} p(x)= & {} e^{-O_{n}(x)}, \nonumber \\ p(x,y)= & {} e^{-O(x, y)}, \text{ and } \nonumber \\ p(y\,\vert \,x)= & {} e^{-O(x \rightarrow y)}, \end{aligned}$$
(8)

where p(x), p(xy) and \(p(y\,\vert \,x)\) are the probability of x, the joint probability of x and y, and the conditional probability of y given x, respectively.

Using the notation introduced in Definition 1, let \(x_1\) and \(x_k\) be two nodes. Define the path \(\underline{x} = x_{1},\ldots , x_{k}\) and let the probability \(p(\underline{x})\) of reaching \(x_k\) from \(x_s\) along the is

$$\begin{aligned} p(\underline{x}) = \prod _{i=1}^{k-1} \frac{w(x_{i} \rightarrow x_{i+1})}{w(x_{i})} \end{aligned}$$
(9)

Note that, if we define a path \(\underline{\tilde{x}} = x_{k},\ldots , x_{1}\), that is the path going in the opposite direction of \(\underline{x}\), we have that

$$\begin{aligned} p(\underline{\tilde{x}}) = \prod _{i=1}^{k-1} \frac{w(x_{k-i} \rightarrow x_{k-i-1})}{w(x_{k-i})} \end{aligned}$$
(10)

and \(p(x) = \displaystyle {\frac{w(x)}{ \sum _{y \in V(\tilde{G})} w(y)}}\) for any node \(x \in V\).

The above shows that, in general, \(p(\underline{x}) \not = p(\underline{\tilde{x}})\). Again, we can easily see that \(p(\underline{x})\) can be written in terms of \(O(\underline{x})\) as

$$\begin{aligned} p(\underline{x}) = e^{\displaystyle {-\sum _{i=1}^{k-1}O(x_{i} \rightarrow x_{i+1})}} = e^{-O(\underline{x})}. \end{aligned}$$
(11)

The above observations and remarks lead to the following result

Lemma 1

Let O(xy) be the joint observability of y and x. Then

$$\begin{aligned} O(x , y) = O(x \rightarrow y) + O(x) \end{aligned}$$
(12)

Proof

Note that

$$\begin{aligned} p(y\,\vert \,x) = \frac{p(x,y)}{p(x)}. \end{aligned}$$

Since

$$\begin{aligned} p(y\,\vert \,x) = e^{-O(x \rightarrow y)}, \end{aligned}$$

then we have that

$$\begin{aligned} e^{-O(x \rightarrow y)} = \frac{e^{-O(x, y)}}{e^{-O(x )}} = e^{O(x)-O(x, y)}. \end{aligned}$$
(13)

The result follows. \(\square\)

3.2 Mutual information

Mutual information plays an important role in assessing the feature space and in particular, in reducing its dimensionality (Shadvar 2012). It is defined as

$$\begin{aligned} I(x;y) = p(x,y) \log {\left( \frac{p(x,y)}{p(x) p(y)} \right) } \end{aligned}$$
(14)

Mutual information can provide useful insights into feature inter-dependencies. In particular, an optimal level of features is linked with a maximum value of I(xy). Similarly to the above, we can write I(xy) in terms of observabilities using Eq. 14, namely

$$\begin{aligned} I(x;y) =e^{-O(x, y)}(O(x)+ O(y) - O(x, y)) \end{aligned}$$
(15)

Eq. 15 can be used to find a lower bound of I(xy) with respect to the observabilities. Recall that

$$\begin{aligned} p(x,y)= & {} p(y \,\vert \,x) p(x) \nonumber \\ {}= & {} e^{-(O(x\rightarrow y) + O(x))} \end{aligned}$$
(16)

Hence

$$\begin{aligned} I(x;y) = e^{-(O(x\rightarrow y) + O(x))} (O(y)-O(x\rightarrow y)) \end{aligned}$$
(17)

Since p(xy) can also be written as \(p(x \,\vert \,y) p(y)\), or in other words as \(e^{-(O(y\rightarrow x) + O(y))}\), using a similar argument we have that

$$\begin{aligned} I(x;y) = e^{-(O(y\rightarrow x) + O(y))} (O(x)-O(y\rightarrow x)) \end{aligned}$$
(18)

Recall from Definition 1, \(\displaystyle {O(y \rightarrow x) = O(x \rightarrow y)- \log {\frac{w(x)}{w(y)}}}\). This implies that, if no extra neighbouring nodes are added to either x or y, then if \(O(y \rightarrow x)\) increases, \(O(x \rightarrow y)\) will increase as well, and vice-versa. This leads to the following result.

Lemma 2

Let I(xy) be the mutual information between x and y. If \(O(y \rightarrow x)\), or \(O(x \rightarrow y)\), are minimised then I(xy) will increase towards its maximum value.

It is important to emphasise that Lemma 2 does not provide an exact evaluation of the behaviour of mutual information, but only an estimation. However, it provides a tool to identify which features should be removed likely to increase mutual information.

3.3 Entropy and its variations

From Eq. 17, we can easily calculate the conditional entropy as

$$\begin{aligned} H(x_{2}\,\vert \,x_{1}) = - e^{(O(x_{1} \rightarrow x_{2})+O(x_{1}))}(O(x_{1} \rightarrow x_{2})+O(x_{1})) \end{aligned}$$
(19)

If we consider such (semantic) networks as dynamical networks, which change with respect to a (discrete) ‘time’ parameter, Eq. 2 can be written to emphasise the iteration time k, as

$$\begin{aligned} O_{k}(x_{1} \rightarrow x_{2}) = -\log {\left( \frac{w_{k}(x_{1} \leftrightarrow x_{2})}{w_{k}(x_{1})} \right) }, \end{aligned}$$
(20)

Assume we remove one semantic feature from \(n(x_{1}, x_{2})\) and let the corresponding weight of such feature be \(\tilde{w}\). Equation 20 can be re-written as:

$$\begin{aligned} \log {(w_{k}(x \leftrightarrow y))}= \log {(w_{k}(x))} - O_{k}(x \rightarrow y). \end{aligned}$$
(21)

Since

$$\begin{aligned} w_{k-1} (x \leftrightarrow y)= & {} w_{k}(x \leftrightarrow y)-\tilde{w} \nonumber \\ w_{k-1} (x)= & {} w_{k}(x)-\tilde{w} \end{aligned}$$
(22)

From the above observations, for a path \(\underline{x} = x_{1},\ldots , x_{k}\), we have that

$$\begin{aligned} O_{k}(\underline{x}) - O_{k-1}(\underline{x}) \ge \nonumber \\ \log {\frac{\prod _{i=1}^{n} \tilde{w} (x_{i})}{(w_{k} (x_{i})-\tilde{w})(w_{k} (x_{n})-\tilde{w})\prod _{i=2}^{n-1} w_{k}(x_{i}-2\tilde{w})}} \end{aligned}$$
(23)

Let

$$\begin{aligned} \Delta _{k}(H(x_{2} \,\vert \,x_{1}))= |H_{k}(x_{2} \,\vert \,x_{1}) - H_{k-1}(x_{2} \,\vert \,x_{1}) |\end{aligned}$$
(24)

From the above equations, we can derive that

$$\begin{aligned} \Delta _{k}(H(x_{2} \,\vert \,x_{1}))\le & {} (O_{k}(x_1 \rightarrow x_2 ) - O_{k-1}(x_1 \rightarrow x_2 ))\frac{\tilde{w}}{w_{k}(x_1)} \nonumber \\\le & {} \Delta _{k}(O_{k}(x_1 \rightarrow x_2 )) \end{aligned}$$
(25)

using the notation in (24), and we can also derive that

$$\begin{aligned} \Delta _{k}(H(x_{2} \,\vert \,x_{1})) \ge \sum _{i=1}^{n-1} \Delta _{k}(O_{k}(x_{i} \rightarrow x_{i+1} )). \end{aligned}$$
(26)

If we use \(H_{k}(x_{i} \,\vert \,x_{j})\) for two adjacent nodes \(x_{i}\) and \(x_{j}\) at time iteration k, as a general entropy ‘indicator’, we can consider its overall contribution based on all the nodes in the network \(G_{k}(V,E)\). In other words,

$$\begin{aligned} H_{k}(G) = \frac{1}{2}\sum _{i \in V} \sum _{j_{i} \in n_{k}(i)} H_{k}(x_{j_{i}} \,\vert \,x_{i}). \end{aligned}$$
(27)

This will allow to evaluate

$$\begin{aligned} \Delta _{k}(H(G)) = \frac{1}{2}\sum _{i \in V} \left( \sum _{j_{i} \in n_{k}(i)} H_{k}(x_{j_{i}} \,\vert \,x_{i}) - \sum _{j_{i} \in n_{k-1}(i)} H_{-1}(x_{j_{i}} \,\vert \,x_{i}) \right) . \end{aligned}$$
(28)

Following a similar approach as above, we can also see that

$$\begin{aligned}{} & {} \Delta _{k}(H(G)) \le \frac{1}{2} \sum _{i \in V} \sum _{j_{i} \in n_{k}(i)} \Delta _{k}(O(x_{i} \rightarrow x_{j_{i}} )) \frac{\tilde{w}}{w_{k}(x_{i})}, \text{ and } \end{aligned}$$
(29)
$$\begin{aligned}{} & {} \Delta _{k}(H(G)) \ge \frac{1}{2} \sum _{i \in V} \sum _{j_{i} \in n_{k}(i)} \Delta _{k}(O(x_{i} \rightarrow x_{j_{i}} )) \frac{\tilde{w}}{w_{k}(x_{i})-w_{k}(z)}. \end{aligned}$$
(30)

Recall that we are removing one semantic relation across all the edges. Therefore, Eq. (29), can be arranged as follows

$$\begin{aligned} \Delta _{k}(H(G))\le & {} \frac{\tilde{w}}{2} \sum _{i \in V} \sum _{j_{i} \in n_{k}(i)} \frac{\Delta _{k}(O(x_{i} \rightarrow x_{j_{i}} ))}{w_{k}(x_i)} \nonumber \\= & {} \frac{\tilde{w}}{2} \sum _{i \in V}\frac{\Delta _{k}(O_{n}(x_{i}))}{w_{k}(x_i)}. \end{aligned}$$
(31)

The above calculation lead to the following result

Lemma 3

The entropy variation \(\Delta _{k}(H(G)\) of the network G at the time iteration k, can be minimised by minimising

$$\begin{aligned} \sum _{i \in V}\frac{\Delta _{k}(O_{n}(x_{i}))}{w_{k}(x_i)} \end{aligned}$$

.

This lemma further demonstrates the role of observability in the dynamical and topological properties of the corresponding networks. Although it would be equivalent to maximise I(xy) directly, this approach strengthen the motivation of using observability to achieve this, which can be used to further investigate and analyse the topological and stochastic properties of the corresponding network.

4 Feature dimensionality reduction

Lemma 2 provides a useful tool to assess a feature space based on the corresponding observabilities. This will be achieved by considering the (semantic) network, as defined above. This will be constructed as a simplicial complex as discussed in Sect. 4.1.

4.1 Simplicial complexes via observability

Observability, despite being a premetric, can naturally create a network structure, or more precisely a simplicial complex \(G = G(V,E)\). More specifically, we assume that a link (or edge) exists between any two nodes in V, if their observability is less than a given (numerical) threshold \(\theta > 0\), which will define a filtration of the simplicial complex (Chazal and Michel 2017).

More formally, we define the edge-set E of a simplicial complex G as

$$\begin{aligned} E = \{ e(v_{i}, v_{j}): O(v_{i} \rightarrow v_{j}), \text{ or } O(v_{i} \rightarrow v_{j}) \le \theta \}. \end{aligned}$$
(32)

Note that (32) implies that G is an undirected network. Furthermore, as discussed in Sect. 1, the nodes of G are \(n-\)dimensional so that they can be embedded in an \(n-\) dimensional space to satisfy the homology construction as per Eq. 33.

We can define the usual simplicial homology

(33)

where \(\partial _{m}\circ \partial _{m+1} = 0\). The \(m-\)th homology group is defined as \(H_{m}(G):= \ker {(\partial _{m} )}/ {{\,\textrm{Im}\,}}{(\partial _{m} )}\)

The assessment of \(m-\)dimensional homology groups, will determine, loosely speaking, the number of \(m-\)dimensional holes in G, which can be associated with the corresponding Betti numbers \(b_{k}\). Recall that a contractible space has trivial reduced homology groups, Hatcher (2002). Based on Alexander and Bishop (1990), and the fact that G is typically a compact space, a non-contractible space is not necessarily a uniquely geodesic space. Therefore, G might not be a uniquely geodesic space and so there are some points (associated with nodes in the simplicial complex) which cannot be connected via shortest path (from an observability point of view). This implies that mutual information cannot be in general maximised for all nodes in V. Therefore, by iteratively removing the appropriate features associated to the nodes in the node-set V, the feature space would be contractible. The above facts lead to the following result.

Proposition 1

Let \(G = G(V,E)\) be a simplicial complex as defined in Sect. 4.1. Its overall mutual information can be maximised by removing those feature dimensions related to the nodes in V, that do not yield trivial reduced homology groups.

The above proposition can be written as the following algorithm

  1. Step 1:

    let \(G = G(V,E)\) be the simplicial complex as per Proposition 1.

  2. Step 2:

    let \(p:G_{n} \rightarrow G_{m}^{k}\) be the projection from the \(n-\)dimensional space G onto the \(m-\) dimensional space \(G_{m}^{k}\), by removing k features so that \(n = m+k\).

  3. Step 3:

    start from \(k=0\)

  4. Step 4:

    if \(b_{k} \not = 0\) for \(G_{m}^{k}\), then \(k=k-1\) and repeat this step

  5. Step 5:

    else, stop.

4.2 Evaluation

As part of this work, an evaluation of the above approach was carried out. The datasets introduced in Trovati et al. (2020) and Trovati et al. (2022) based on ConceptNet (Liu and Singh 2004) were used to create a selection of semantic networks (with approximately 150 nodes), based on different values of \(\theta\). Each node had 10 features.

Table 1 depicts the main results for different \(\theta\) which generates different edges and from 10 different reductions

Table 1 Details of the evaluation

The data used in this evaluation is limited due to the extensive manual labelling required, which has been created by the authors as part of ongoing research (Trovati et al. 2022). However, more datasets are being designed to provide more comprehensive data-driven evaluations for future research and implementations.

5 Conclusions

In this article, a method to identify the most likely feature dimensions to reduce based on on homological properties of observability is introduced. Despite the limited experimental evaluation, the potential of this approach is clear, both from theoretical and implementation perspectives. In fact, the use of the concept of observability suggests important research areas to further investigate where the dynamical and topological properties should be integrated to provide a more comprehensive understanding of the evolving nature of data. In particular, future steps include a deeper assessment of different network topology properties which can provide further insights.