Introduction

Deep learning has revealed to be profoundly productive in a comprehensive variety of jobs, including speech recognition [1], natural language processing [2], and computer vision [3]. It presents us with the aesthetically pleasing durable framework for extracting useful characteristics and excellent tensor information from a Euclidean structure. However, there is a vast amount of data in the real world, such as social interactions [4], bio-molecules [5], and journal citations [6] that can be naturally modelled by graphs. As a result, expanding the convolution network to graph data is crucial and interesting in improving graph-based activities. Several efforts to apply convolutions to graphs data have recently been made [7, 8].

A graph is a non-linear data form. Many other graphs exist in real life and can be abstracted into graph form, including communications infrastructure, process states, social networks, subway networks, and many others. It is critical to acquire appropriate graph modelling from node and edge attributes and the topological graph formation to exploit the rich knowledge of graph-structured information. In response to an effectiveness of deep learning models on grid-structured information, a slew of graph neural network [9] models has been suggested. Graph convolution [10] consolidates and regenerates characteristic Vectors from the local environment, a key component of neural graph networks. Graph neural networks acquire a node demonstration at the low dimension subspace where neighbouring nodes at a graph have the exact representation by combining graph structural features with node attributes based on Laplace smoothing [11]. Researchers have suggested graph level pooling systems [12] which compact a representation of the node into a global function vector for study a demonstration for a whole graph. In several subsequent graphs processing jobs, for example graph classification [14], node classification [13], etc., the graph or node demonstration acquired through neural graph network have attained up to date efficiency.

Node classification [6, 13] is a crucial activity on graph data that primarily determines node classes based on node characteristics and graph topology. GCNs are frequently used to define and recognise the description of each node. Because of a large quantity of graph information with high labelling costs, some node classification options are semi-supervised. For instance, at the citation system, where node describe edges and publications depict citation connections, classifying semi-supervised nodes is to decide the mark of each publication based on a limited amount of labelled data [15]. This paper proposes a semi-supervised node classification with feature learning.

Numerous character recognition approaches are presented for acquire a low dimension interpretation of high dimension information to overcome a curse of dimensionality. Most of them can be educated with limited data and use closed-form solutions or convex optimisation as their primary learning strategies. One of the supervised feature learning techniques using a framework for graph embedding is marginal Fisher analysis (MFA) [16]. It utilises the penalty graph for inter-class separability also the intrinsic graph for intraclass compactness. By using generalised eigenvalue decomposition, the best MFA solution can be found.

This paper suggests a novel deep learning solution built on feature learning for address existing DL algorithm problems while also incorporating the benefits of feature learning. To initialise this deep architecture, two function learning MFA and KPCA [17] layers are used instead of random initialisation. First, this paper creates a non-linear weight matrix used for plan a input information into the high dimensional space, thus increasing an architecture's power. The MFA and KPCA layers are then used to layer-by-layer (LbL) study a lower-dimensional demonstrations of the information. Lastly, a final feature layer is attached to the softmax layer. The proposed work is compared with deep learning models on graph dataset (include PubMed, Citeseer, and Cora) and other dataset. Results demonstrate that DLM-SSC outperforms other deep learning models.

This paper's significant participation is as follows:

  • A novel deep architecture called DLM is suggested for semi-supervised node labelling. The first secret layer of DLM has twice as many neurons as the input layer. Four layers of attribute learning methods are then used to study low-dimensional demonstrations of input information. Finally, nodes are classified using a multiclass classifier.

  • Two feature learning methods, MMFA and KPCA, are used to reduce the parameters. Three hidden layers are added to DLM based on these selected features.

  • On semi-supervised classification tasks, comprehensive analyses in citation and publication datasets (Cora, Pubmed, Citeseer) show DLM-SSC perform better than benchmark techniques in contexts of precision.

The rest of the paper is arranged like this: the next section examines associated studies on deep learning techniques, node classification, and feature learning in a brief literature review. The key methods are discussed in the following section. Furthermore, the next section delves the proposed strategy. Following section defines the evaluation metrics and introduces the experimental design. In the next section, an experimental outcomes are evaluated, also finally, the last section provides the conclusion of this paper.

Related works

This segment examines some previous research that applies to this thesis. Node classification, function learning, and deep learning architecture are also included.

Node classification

Gong and Ai [13] recommend a block Adaptive Graph Convolutional Neural Networks for efficiently study descriptions of a specific node for node ordering jobs. This analysis develops a neighbourhood adaptive kernel, a convolutional kernel abstracted from a spreading process, for acquire also combine appropriate neighbourhood node data for all nodes in a very targeted manner. N-GCN is a proposal by Abu-El-Haija et al. [18] for semi-supervised node classification using multi-scale graph convolution. By practising individual situations of GCNs over node sets identified at distinct intervals in irregular positions, it seeks an aggregate of the problem outputs that optimises the distribution goal. The DEMO-Net architecture is proposed by Wu et al. [19]. It is a generalised graph neural network model built on the premise which nodes by a corresponding degree worth have a similar graph convolution. A multi-task graph convolution feature with a specific degree value is provided to help explain the node demonstration.

Liu et al. [20] suggest a different higher-order GCN with multi-scale community clustering to semi-supervised node labelling. MNPooling and high-order convolution are two primary categories. MNPooling proposes three knowledge aggregation approaches for integrating data from multiple neighbourhoods while preserving graph topology. In high-order convolution, weight sharing restricts the range of dimensions. [15] introduced a different GCN for semi-supervised data analysis on charts. This system represents a softmax, a discriminable capsule sheet, symmetrical grinding layers, coarsening layers, with SEGCN layers.

Li and Pi [21] proposes a deep neural network (DNN) technique for node labeling called DNNNC at the sense of DL. Create a positive pointwise reciprocal knowledge matrix from an adjacent matrix. The information is feed into a DNN with deep stacked sparse auto-encoders with a softmax layer that is very well equipped for node labelling in the DNN framework and can attain a node interpretation whilst encodes rich non-linear semantic with structural data. Molokwu et al. [22] proposed a new approach for processing and retrieving valuable data from OSN systems to assist at node categorisation also group identification activities. It uses an edge sampling technique to take advantage of social graph features by studying each actor's history concerning nearby nodes to create vector-space embeddings for each actor. Madhawa et al. [23] looked at using an adaptive learning algorithm to increase node classification performance on attributed graph labels.

Li et al. [43] present a novel semi-supervised learning technique combining dynamic graph learning and self-paced learning. The graph learning approach proposed by Kang et al. [44], preserves both the local and global structure of data. The global structure is captured using the self-expressiveness of samples, while the local structure is respected using an adaptive neighbour approach.

Feature learning

In feature learning models, dimensionality compression is crucial because it eases solving problems like visualising high dimension information also minimising the dimensionality curse. There are three types learning: supervised (models are trained using labeled data), unsupervised (trained using unlabelled data) and semi-supervised (combine label and unlabelled data). For semi-supervised learning, Liu et al. [29] suggest a novel hierarchical, streamlined graph-based coarse feature extraction method. This method combines learning with local structure, sparse approximation, also mark spread to improve information dimension reduction. A PCA definition of minimising least squares calculation faults are regularised by graph drawing, which combines different local manifold embedding approaches into a generalised scheme to preserve global with regional low dimension subspace [30]. Unlike a PCA method, this method's simplified least squares solution believes in information sharing and the instance penalty at all data points. Gou et al. [31] suggest a dimensionality reduction technique based on a graph called Discriminative globality also locality protecting graph implanting through scheming good locality and globality keeping graph buildings. Bidirectional edges weights are recently described in the constructed graphs, taking into account mathematical sharings of all point of an edge with a class bias. Rajabzadeh et al. [32] suggest a novel supervised dimensionality reduction approach that optimises the recently developed with proficient goal task for study each class changes.

In correlation to an individual conversion, this strategy catches more discriminative knowledge from each group of data. Many feature learning models offer an efficient solution for dimensionality reduction implementations. Feature learning models can underperform for vast, complicated issues. The advantages of deep architectures with feature learning are combined to offer the new supervised node classification approach for the DL algorithm.

Deep learning architecture

There are a few tasks on deep architecture-based feature learning models. To address the scene recognition process, Yuan et al. [24] presented the enhanced multi-layer learning technique. This model learns all visual identification features in the not supervised way. Kejani et al. [25] present a model that improves GCN label propagation. It is made up of two terms: supervised and unsupervised. A supervised word enforces the appropriate word among an expected with the identified label. An unsupervised word requires that a scheduled label of all data samples be smooth. Zhu et al. [26] suggest the deep architecture and flexible embedding based on a graph that digs deep into the data's structural detail. A multiple geometrical organisation of information is included in this deep architecture.

Trigeorgis et al. [27] suggested a deep semi-Non-negative matrix factorisation that can acquire hidden interpretations from unknown parameters in a dataset. This model has been learned to consider low-dimensional representations, which are more likely to cluster. Ngiam et al. [28] created an unsupervised paradigm for understanding feature representations through various modalities called an adaptive architecture. They believed that multi-modality representation learning is superior to single modality representation learning, citing intense video and audio data sets as evidence. Niloy et al. [45] suggests deep dive for improving classification performance.

Preliminaries

Node classification

When data is interpreted as a graph, node classification plays a vital role in learning problems. The nodes in graph G are called V nodes, and the edges connecting them are called E edges. A graph's edges can also be directional. In functional implementations, for example, recommendation system [33], applied chemistry [34], with social network research [35], node classification is commonly used. The attributed graph G = (V, E) and N nodes are provided like an adjacency matrix AdjMat \(\in \) RN × N and the node attribute matrix AttMat \(\in \) RN × F in a node classification problem. The number of attributes is denoted by the letter F. The graph's adjacency matrix, with each entry aij indicating the edge weight among i and j. The AdjMat is characterised as

$$ a_{ij} = \left\{ {\begin{array}{*{20}l} {1\quad {\text{if}} \left( {v_{i} ,v_{j} } \right) \in E} \\ {0\quad {\text{Otherwise}}} \\ \end{array} } \right.. $$
(1)

If a graph is not directed, an adjacency matrix A is symmetric. A degree matrix is the diagonal matrix explained as \(D = \left\{ {d_{1} ,d_{2} , \ldots ,d_{N} } \right\},\) here diagonal element di is the row total of an adjacency matrix such that

$$ d_{i} = \mathop \sum \limits_{j = 1}^{N} a_{ij} . $$
(2)

Each node vi has a real-valued feature vector xi ∈ RN × F, and vi owned one of the C class labels.

Graph neural networks (GNN)

Graph neural networks is a kind of neural network that is constructed for training by characteristic graphs. GNN techniques [11] attain modern efficiency in a node categorisation issue, which is the noteworthy enhancement of the recently utilised embedding algorithm [36]. The ability of GNNs to model both structural information and node attributes together sets them apart from previous versions. In theory, all GNN models have a message passing mechanism that spreads a node's feature information to its neighbours. To developing attributes for the various attribute space, most GNN architectures employ a learnable parameter matrix. In most cases, two or three of these layers are combined with a non-linear function.

GCN include the input layer, last perceptron layer with numerous hidden layers. Provided the adjacency matrix AdjMat with the input feature matrix X(0) = X, GCN perform a subsequent layer-wise spread at hidden layers like the equation

$$ X^{k + 1} = \sigma \left( {D^{{ - \frac{1}{2}}} {\text{Adj}}_{{{\text{Mat}}}} D^{{ - \frac{1}{2}}} X^{k} W^{k} } \right). $$
(3)

Here D = diag(d1,d2,…dn) is a diagonal matrix with k = 0, 1, …,K − 1 and \(d_{i} = \sum\nolimits_{j = 1}^{n} {a_{ij} } , W^{k} \in R^{{d_{k} \times d_{k} + 1}}\), d0 = p is the layer particular weight matrix requiring to be learned.

σ (.) is denoted like the activation method, for example ReLU(.) = max(0,.) and \(X^{k + 1} \in R^{{n_{k + 1} \times d_{k + 1} }}\) Like a consequence of activations at a kth layer. The final layer of semi-supervised classification [37] is defined as,

$$ Z = {\text{softmax}}\left( {D^{{ - \frac{1}{2}}} {\text{Adj}}_{{{\text{Mat}}}} D^{{ - \frac{1}{2}}} X^{\left( k \right)} W^{\left( k \right)} } \right), $$
(4)

where \(W^{\left( k \right)} \in R^{{d_{K} \times o}}\) With o indicates a no. of classes. A result \(Z \in R^{n \times o}\) Indicates the label forecast for the entire information X at that every row Zi indicates a label forecast for an ith node.

Marginal Fisher’s analysis (MFA)

Yan et al. [38] showed many dimension techniques could be merged into the graph embedding structure. By protecting a geometric graph form from the input space to the function space, the dimension reduction methods in this scheme produce low-dimensional features. The novel supervised technique known as MFA was presented in addition to this method. MFA (marginal fisher analysis) is a non-linear problem-solving manifold learning algorithm. The key concept behind MFA is to create two graphs based on sample neighborhood connections, then create a condition for intra-class sample cohesion and inter-class sample separation based on the two graphs.

The intra-class point adjacency association is represented by the intrinsic graph Gc, in which every instance is linked to its k-nearest class neighbours. Marginal point pairs of various classes are attached at a penalty graph Gp, which depicts the inter-class marginal point adjacency association. Let \(N_{{k_{1} }} \left( {x_{i} } \right) = \left\{ {x_{i}^{1} ,x_{i}^{2} , \ldots ,x_{i}^{{k_{1} }} } \right\}\) is a collection of its k1 nearby neighbours. A weight matrix is defined as,

$$ W_{c,ij} = \left\{ {\begin{array}{*{20}l} {1\quad {\text{if}} x_{i} \in N_{{k_{1} }} \left( {x_{j} } \right)\quad {\text{or}}\quad x_{j} \in N_{{k_{1} }} \left( {x_{i} } \right)} \\ {0\quad {\text{Otherwise}}} \\ \end{array} } \right.. $$
(5)

After that, an intra-class density is explained like a total of distances among all nodes also its k1-nearby neighbours, which belong to a similar class:

$$ \begin{gathered} \mathop \sum \limits_{ij} y_{i} - y_{j}^{2} W_{c,ij } \hfill \\ = 2{\text{Tr}}\left( {V^{T} X\left( {D_{c} - W_{c} } \right)X^{T} V} \right) \hfill \\ = 2{\text{Tr}}\left( {V^{T} XL_{c} X^{T} V} \right), \hfill \\ \end{gathered} $$
(6)

where Dc is the diagonal matrix by \(D_{c,ii} = \sum\nolimits_{j} {W_{c,ij} } , L_{c} = D_{c} - W_{c}\) is a Laplace matrix.

Regarding every points pair (xi, xj) from a various class, also attach the edge among xi and xj if xj is one of the xi's k2-nearby neighbours whose class labels are dissimilar from a class label of xi.

Let \(N_{{k_{2} }} \left( {x_{i} } \right) = \left\{ {x_{i}^{1} ,x_{i}^{2} , \ldots ,x_{i}^{{k_{2} }} } \right\}\) is a collection of its k2 nearby neighbours. A weight matrix is defined as,

$$ W_{p,ij} = \left\{ {\begin{array}{*{20}l} {1\quad {\text{if}} x_{i} \in N_{{k_{2} }} \left( {x_{j} } \right)\quad {\text{or}} \quad x_{j} \in N_{{k_{2} }} \left( {x_{i} } \right)} \\ {0\quad {\text{otherwise}}} \\ \end{array} } \right.. $$
(7)

The inter-class separability is illustrated through a penalty graph that has the word

$$ \begin{aligned} &\mathop \sum \limits_{ij} y_{i} - y_{j}^{2} W_{p,ij } \hfill \\ &\quad= 2{\text{Tr}}\left( {V^{T} X\left( {D_{p} - W_{p} } \right)X^{T} V} \right) \hfill \\ &\quad= 2{\text{Tr}}\left( {V^{T} XL_{p} X^{T} V} \right), \hfill \\ \end{aligned} $$
(8)

where Dp is a diagonal matrix with \(D_{p,ii} = \sum\nolimits_{j} {W_{p,ij} } , L_{p} = D_{p} - W_{p}\) is the Laplacian matrix.

The Criterion of Marginal Fisher is explained like this:

$$ \arg \mathop {\min }\limits_{v} \frac{{V^{T} XL_{c} X^{T} V}}{{V^{T} XL_{p} X^{T} V}}. $$
(9)

Following the effectiveness of implementing MFA in a variety of fields, some issues remain unsolved.

  • In MFA, there is a unique problem that emerges from the fact that most training samples are much smaller than the dimension of each image, a flaw known as the singular or minimal sample size challenge.

  • MFA is a supervised learning approach that requires labelled information to ensure successful generalisation on test samples. However, it is simple to obtain many face images for real-world face recognition, although only some of them are manually labelled. Since there isn't enough labelled data, strictly supervised MFA can't be well trained in this case.

  • Since MFA is still a linear approach in nature, it is insufficient to explain the difficulty of real face images due to variations in lighting and pose. The most widely used kernels are data-independent kernels, which may not be compatible with the intrinsic manifold structure uncovered by unlabelled data.

Our approach

This segment illustrates the recommended strategy in detail. First, describe the intended deep learning architecture. The following explains the two feature learning modules and finally represents a multiclass classifier. Table 1 gives symbols with notations utilised in this work.

Table 1 Notations and symbols

Let graph G by n nodes with m edges has an attribute matrix \(X \in {\mathbb{R}}^{n \times f}\) by f features per node, and training labels Y, annotating a partial set of nodes with the c possible classes. Consider A indicates an adjacency matrix of G, here a non-zero entry Aij specifies the edge among nodes i with j. The weight matrix W is denoted as,

$$ w_{ij} = \left\{ {\begin{array}{*{20}l} {1 \times f{\text{dist}}\left( {v_{i} ,v_{j} } \right)\quad {\text{if}} \left( {v_{i} ,v_{j} } \right) \in E} \\ {0\quad {\text{Otherwise}}} \\ \end{array} } \right.. $$
(10)

Here fdist(vi, vj) denote the feature distance between vi, vj. If the feature matrix is a binary vector, then the Hamming space is used; otherwise, use Euclidean distance.

$$ f{\text{dist}}\left( {v_{i} ,v_{j} } \right) = \left\{ {\begin{array}{*{20}l} {v_{i} \oplus v_{j} , \quad{\text{if}} X \in \left\{ {0,1} \right\}^{n \times n} } \\ {\sqrt {\mathop \sum \limits_{i = 1}^{n} \left( {q_{i} ,p_{i} } \right)^{2} } ,\quad {\text{otherwise}}} \\ \end{array} } \right.. $$
(11)

Deep architecture model for node classification

This section discusses the proposed work's overall deep architecture. It could view as the vast paradigm for constructing information representations. Furthermore, this framework uses function training elements by several result dimension for the train a data demonstrations at all layer. Mapping methods among successive layers are attained through optimising feature learning models layer by layer. The proposed deep architectures are depicted in Fig. 1 as a high-level outline. In this diagram, three hidden layers (H1, H2 and H3) are attached to the input with output layers. A first layer is built using a random matrix Wrm.

Fig. 1
figure 1

Proposed architecture

The novel demonstration of the input A could write like this

$$ hv_{1} = \rho \left( {A \cdot W_{rm}^{T} \cdot X} \right), $$
(12)

where ρ(.) is the non-linear activation method, subsequently, characteristic training patterns are utilised to boot the next layer. Results of the following hidden layers are

$$ hv_{t} = \rho \left( {A \cdot W_{{F_{t - 1} }} \cdot hv_{t - 1} } \right). $$
(13)

The load patterns of the second and third layers in Fig. 1 acquired from function training models are WMMFA and WKPCA. Finally, a softmax multiclass classifier is used as a final layer for labelling works. At an initial esoteric layer, high dimensional input information descriptions could acquire. Lower-dimensional embeddings can then be progressively known using two function learning models. Feature learning modules initialise a weight matrice of hidden layers, except for the matrix at an initial hidden layer, resulting in better results than different DL methods that use arbitrary matrix.

Modified MFA

This section describes how to build deep learning models using modified marginal Fisher analysis (MMFA). There are several advantages to using MMFA. MMFA is much more specific for feature selection than many conventional characteristic learning models, such as LDA, since it makes no assumptions regarding the knowledge dissemination of individual group. Furthermore, the margins between classes will accurately describe the classes' separability.

To boost MFA, this paper considers using the best discriminate attributes from together unlabelled with labelled instances, as well as treating unlabelled with labelled information separately while constructing a fitness method to achieving maximum inter-class margin whilst decreasing intra-class difference.

The between-class also within-class of MFA is constructed like this:

Between-class: Wij = Wji = 1 if (xi,xj) is between the k2 shortest pairs between the set \(\left\{ {\left( {x_{i} ,x_{j} } \right) | x_{i} \in X_{c} , x_{j} \notin X_{c} } \right\}\).

Within-class: Wij = Wji = 1 if xj is between the k1 nearby neighbor of xi at a similar class.

Where W with W′ are similarity matrices that indicate within-class correlation and between-class dissimilarity, correspondingly. MFA reduces resemblance while concurrently maximising separability at low dimension space.

The following section explains the proposed modified marginal fisher analysis.

Initially, create the nearest neighbour graph. Find the k closest neighbours of data point xi and construct the edge among xi also its neighbours. The set of its k most immediate neighbours is denoted by \(\left( {x_{i} } \right) = \left\{ {x_{i}^{1} ,x_{i}^{2} , \ldots ,x_{i}^{k} } \right\}\). The adjacency matrix can be written as follows:

$$ A_{ij} = \left\{ {\begin{array}{*{20}l} {\exp \left( { - \frac{{x_{i} - x_{j}^{2} }}{{\sigma^{2} }}} \right)\quad{\text{ if}} x_{i} \in N\left( {x_{j} } \right)\quad {\text{or}}\quad x_{j} \in N\left( {x_{i} } \right)} \\ {0\quad {\text{otherwise}}} \\ \end{array} } \right.. $$
(14)

The weight matrix within-class and between-class is defined by,

$$ W_{ij} = \left\{ {\begin{array}{*{20}l} {\alpha A_{ij} ,\quad {\text{if}} c_{i} = c_{j} } \\ {0,\quad {\text{Otherwise}}} \\ \end{array} } \right., $$
(15)
$$ W_{ij}^{^{\prime}} = \left\{ {\begin{array}{*{20}l} {\beta A_{ij} ,\quad {\text{if}} c_{i} \ne c_{j} } \\ {0,\quad {\text{otherwise}}} \\ \end{array} } \right. . $$
(16)

Here, α and β are adjustable parameter and α + β = 1.

The diagonal matrix for within-class and between-class is defined by,

$$ D_{ii} = \mathop \sum \limits_{j} W_{ij} , $$
(17)
$$ D_{ii}^{^{\prime}} = \mathop \sum \limits_{j} W_{ij}^{^{\prime}} . $$
(18)

The between-class separability also within-class similarity is defined as follows:

$$ S = 2{\text{Tr}}\left( {A^{T} X\left( {D - W} \right)X^{T} A} \right), $$
(19)
$$ S^{^{\prime}} = 2{\text{Tr}}\left( {A^{T} X\left( {D^{\prime} - W^{\prime}} \right)X^{T} A} \right) . $$
(20)

The objective functions of MMFA are defined like,

$$ W_{{{\text{MMFA}}}} = \arg \mathop {\max }\limits_{A} \frac{{Tr\left( {A^{T} X\left( {D^{\prime} - W^{\prime}} \right)X^{T} A} \right)}}{{Tr\left( {A^{T} X\left( {D - W} \right)X^{T} A} \right)}}. $$
(21)

Kernel PCA

Given the data Y = [y1,y2,…,yn], PCA get the linear subspace of dimensional d such that the entire information lies on or near to it., at a Euclidean distance. PCA solves

$$ \mathop {\min }\limits_{{U_{d,} \left\{ {x_{i} } \right\}}} \mathop \sum \limits_{i = 1}^{N} y_{i} - U_{d} x_{i2}^{2} , $$
(22)
$$ U_{d}^{T} U_{d} = I, $$
(23)

where Ud is the matrix of orthonormal. The best result is \(x_{i} = U_{d}^{T} y_{i}\).

The optimal solution for kernel PCA is

$$ \mathop {\min }\limits_{X} K_{y} - X^{T} X_{F}^{2} , $$
(24)
$$ XX^{T} = D, $$
(25)

where Ky = YTY is kernel matrix, D indicates diagonal matrix containing eigenvalues of Ky.

In several application settings, structural information implying or being implied by dependencies can benefit the dimensionality reduction task. This knowledge could encode at a graph also embodied at X by graph regularisation. Especially, assume there has the graph G on that information is soft. Specifically, vectors {xi} that correspond to connected nodes of G are near to all others at Euclidean distance. By A indicating an adjacency matrix of G, aij = 1 if node i is linked by node j. A Laplacian of G is LG = D − A, here D is the matrix of diagonal by entries \(d_{ii} = \sum\nolimits_{j} {a_{ij} }\).

The kernel PCA [17] can be rewritten as,

$$ \mathop {\min }\limits_{X} - {\text{tr}}\left( {XK_{y} X^{T} } \right) + \gamma {\text{tr}}\left( {XL_{G} X^{T} } \right), $$
(26)
$$ XX^{T} = I. $$
(27)

The following steps are used to find the low dimensional representations.

  1. 1.

    Compute r(LG).

  2. 2.

    Compute the largest eigenvalues and corresponding eigenvectors of Kyγr(LG).

  3. 3.

    Collect Vd.

Here r(.) is the non-decreasing scalar method of eigenvalues of LG. Table 2 shows the graph of Laplacian kernels

Table 2 Example of kernel types

Implementation and data design

This segment exhibits the implementation and trial picture of the recommended work. Three graph-related data were used with the Java framework for validation and analysis of the proposed model.

Citation graphs such as CORA, PubMed, and CiteSeer [39] are widely utilised. All of that undirected graph is prepared up of nodes (documents) with edges (citations). When one text references another, an edge connects them. The characteristics of the node refer to bag-of-words features of a text material of documents.

Each node in the citation datasets corresponds to a journal article published in that journal. A citation from one article to another is symbolised through the edge among two nodes, and a label represents the subject of the article. Each node in every dataset has a binary bag-of-words (BoW) feature vector. A BoW is based on a summary of an article. As a result, provided a BoW of an article's citations and abstract to further (probably labelled) article, a job is for forecast a topic of those articles. A graph dataset is shown in Table 3.

Table 3 Graph-based data set

CiteSeer

The CiteSeer dataset is mostly made up of computer-related research publications, such as those on machine language, information processing, databases, and artificial intelligence, among other topics. The CiteSeer collection contains 3312 scientific papers that are divided into six categories (Agents, AI, DB, IR, ML, HCI). There are 4732 linkages in the citation network. A 0/1-valued word vector describes each publication in the dataset, indicating the existence or absence of the associated word from the dictionary. There are 3703 distinct terms in the dictionary.

Cora

A Cora data set is mostly made up of articles on ML, for example, reinforcement learning, probabilistic approaches, genetic algorithms, and neural networks, among other topics. The Cora dataset contains 2708 scientific publications that are divided into seven categories (neural networks—818, probabilistic methods—426, genetic algorithms—418, theory—351, case based—298, reinforcement learning—217, rule learning—180). There are 5429 linkages in the citation network. A 0/1-valued word vector describes each publication in the dataset, indicating the existence or absence of the associated word from the dictionary. There are 1433 distinct terms in the dictionary.

PubMed

A PubMed dataset is mostly made up of biomedical research articles. Nodes denote texts; edges indicate citation relations. The Pubmed Diabetes dataset contains 19717 scientific publications about diabetes from the PubMed database, which are divided into three categories (diabetes mellitus—experimental, type 1, type 2). There are 44,338 linkages in the citation network. A TF/IDF weighted word vector from a vocabulary of 500 unique terms is used to describe each publication in the dataset.

All publication is represented through the 0 or 1 valued word vector representing a presence or absence of an equivalent term from a dictionary in each of the three citation datasets.

Evaluation results

This section evaluates the efficiency of this proposed technique in opposition to numerous further associated practices on three traditional citation standard data sets. To verify the performance of the proposed DLM-SSC, various baselines, including graph embedding approaches, graph convolution approach, and high-order graph convolution, comes. Table 4 shows the baseline methods.

Table 4 Baseline methods

The performance of a proposed technique is analysed by classification accuracy. Table 5 demonstrates the accuracy comparison presented by several baseline techniques.

Table 5 Result comparison of various methods

Figure 2 shows the accuracy comparison of various baseline methods for three citation data sets. The proposed method gives higher classification accuracy 85.3 and 76.5 for Cora and citeseer dataset. The D-SEGCN [15] have high accuracy for PubMed data.

Fig. 2
figure 2

Accuracy comparison

Table 6 and Fig. 3 shows the classification result for different learning models. From those results, the feature learning model MMFA has higher accuracy compared to the KPCA feature learning model. When combining both MMFA and KPCA feature learning models with DLM, the model increase the classification accuracy.

Table 6 Classification result for different learning models
Fig. 3
figure 3

Classification result for learning models

Table 7 shows the results for a dissimilar number of hidden layers for the three citation dataset. When increasing the hidden layers, the outcome will be reduced.

Table 7 Classification result for different no of hidden layers

Conclusion

A novel neural network deep architecture for semi-supervised node classification is presented in this article. A supervised pretraining approach may be used to initialise the architecture. This approach employs two distinct learning methods. As a hidden layer, each function taught by MMFA and KPCA is used. These two learning algorithms efficiently learn lower dimensional of data. For experiments three publicly available citation dataset (Cora, Citeseer and Pubmed) are used. The results shows that the proposed work achieves 85.3% accuracy for Cora, 76.5% for Citeseer and 78.7% for Pubmed dataset. Extensive tests show that this proposed work executes superior to further similar approaches over small datasets. This principle would be extended to graph classification in the future, and deep active learning approaches for graph-related learning methods can be used. The baseline strategies are seen in Table 4.