Information set supported deep learning architectures for improving noisy image classification

Bhardwaj, Saurabh; Wang, Yizhi; Yu, Guoqiang; Wang, Yue

doi:10.1038/s41598-023-31462-6

Information set supported deep learning architectures for improving noisy image classification

Article
Open access
Published: 17 March 2023

Volume 13, article number 4417, (2023)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

Information set supported deep learning architectures for improving noisy image classification

Download PDF

Saurabh Bhardwaj¹,
Yizhi Wang²,
Guoqiang Yu² &
…
Yue Wang²

1307 Accesses
2 Altmetric
Explore all metrics

Abstract

Deep learning models have been widely used in many supervised learning applications. However, these models suffer from overfitting due to various types of uncertainty with deteriorating performance when facing data biases, class imbalance, or noise propagation. The Information-Set Deep learning (ISDL) architectures with four variants are developed by integrating information set theory and deep learning principles to address the critical problem of the absence of robust deep learning models. There is a description of the ISDL architectures, learning algorithms, and analytic workflows. The performance of the ISDL models and standard architectures is evaluated using a noise-corrupted benchmark dataset. The experimental results show that the ISDL models can efficiently handle noise-dominated uncertainty and outperform peer architectures.

Open set task augmentation facilitates generalization of deep neural networks trained on small data sets

Article Open access 09 December 2021

Impact of Autotuned Fully Connected Layers on Performance of Self-supervised Models for Image Classification

Article 24 January 2024

Auto CNN classifier based on knowledge transferred from self-supervised model

Article 21 June 2023

Introduction

Although Deep Learning Neural Networks (DL-NN) are efficient at classifying images, noise in either training or testing dataset propagates through the layers of DL-NN or Convolutional Neural Network (CNN) and significantly deteriorates the performance of these models. The information-set deep learning (ISDL) architectures are introduced here for handling noisy data applicable to broad supervised learning tasks. To develop noise-resistant deep learning models, researchers employed a variety of approaches, including modifications to the architecture¹, regularization methods^2,3,4, and in the loss functions⁵. In some studies, image restoration was tried to restore clear latent images from corrupted observations. The researchers trained the state-of-the-art deep neural networks on images from the original dataset without noise and then used to classify images degraded by noise⁶. It has been shown that the performance of state-of-the-art DL-NN decreases when classifying low-quality images.

The concept of information set was first introduced by Hanmandlu et al.⁷ based on an entropy framework, mainly aimed to address the limitations of fuzzy set theory⁸. The potential of entropy-based feature extraction approaches is underutilized compared to conventional techniques like PCA, ICA, LDA, etc. Information set theory has been proven to be highly effective for addressing uncertainty and achieving superior performance in various settings. The integration of the information set with deep learning models is proposed to leverage this capability in deep learning models. The ISDL architectures are general and broadly applicable to many deep-learning tasks. In classic fuzzy set theory, membership functions (MF) play central roles while original signals (information source) and their potential interactions with fuzzy memberships are primarily ignored³. Since the MF value only measures the extent to which an information source value belongs to the set, it cannot accurately express the overall uncertainty pertaining to all information source values. To address this limitation, the information set is formulated based on an entropy framework and used to modulate fuzzy memberships by some form of the original signal^7,9,10. The Pal and Pal entropy¹¹ has been extended to an information-theoretic entropy structure and further established into information-set theory by subsequent improvements^{9,10,12,13,14,15,16}.

In the past, the Information-set theory has been exploited and integrated into many machine-learning models to develop various features and classifiers in noisy environment. The ability of information set theory to represent probabilistic uncertainty and possibilistic certainty is described by Grover et al.¹⁷. Six features are created for the face recognition application Sayeed et al.¹⁸ utilizing the information set, the Hanman Filter (HF), and the Hanman Transforms (HT). The HF is designed to adjust the information values using a cosine function. In contrast, the HT is designed to evaluate the information source values based on the information obtained from them. The new features based on higher-order information sets are developed by Grover et al.¹⁹, which use fewer features per sample and have a lower time complexity than the most recent features. It was demonstrated that these features could accurately represent the multispectral palmprints. In addition to feature extraction, an information processing-based fuzzy classifier is also developed. The evolutionary method known as Human effort for achieving the goal (HEFAG), which is based on the human approach to learning and does not require algorithmic specific parameters, was developed by using information set theory²⁰. The two-fold information set (TFIS) features for text-independent speaker identification and gate recognition is developed^12,16. The entropy framework creates the TFIS features, which capture spatial and temporal information components. The TFIS features are fewer in number, which reduces computing complexity and time. These features boost performance in noisy environments. Moreover, the so-called swish activation function, recently proposed by google brain, exhibits improved performance in deep learning models, particularly in image classification and machine translation²¹. With a closer inspection of both formulation and experimental results, it is found that the swish activation function also has some roots in information-set theory.

The Infor-Set Deep Feedforward Networks (ISNN) and Infor-Set Convolutional Neural Networks (ISCNN) and their variants are the two Infor-Set supported Deep Learning architectures that are proposed and described here. ISNN employs the Infor-Layer, which is applied both after the source and after each dense layer. The Convolutional Neural Network (CNN) is altered in ISCNN by adding the Infor-Layer and/or by swapping the Infor-pool for the Pooling layer. The key benefit of the proposed models is that they perform well on noisy samples without any additional pre-processing after being trained on clean samples. The various high-level features corresponding to the CNN layers are enhanced with the aid of the Infor-layer and/or Infor-Pool layer. In the current work, an information-set layer (Infor-Layer) and a pooling method (Infor-Pool) are proposed by using information sets that are further integrated with prominent deep learning designs to improve the performance of deep learning architectures. The Infor-Layer is added after the input and between the standard layers to extract effective information and advance to deeper representations. In contrast, the Infor-Pool acquires localized information and reduces dimensionality. These specifically designed information set layers and pooling methods improve the noise robustness of classic deep learning models.

The effectiveness and robustness of the two ISDL architectures and standard models are assessed by using two independent benchmark datasets that have been degraded by noise. To show how effective the suggested Infor-layer and Infor-Pool layers are, these reformulated layers are added to the classic CNN designs, and the performance is compared to that of the architectures without them. The experimental results show that the proposed ISDL architectures can efficiently handle uncertainty and related issues and achieve superior performance compared to peer methods, where the data are corrupted with the noise of varying Peak Signal-To-Noise Ratio (PSNR).

Results

Infor-set based deep learning (ISDL)

When the input data are affected by noise, the information set theoretically measures the quality of contaminated attribute values in terms of possibilistic uncertainty³. Based on this interpretation, the ISNN and ISCNN are proposed to boost the classification performances.

Multiple evaluation criteria, including Accuracy, Precision, Recall, F1 score, and ROC-AUC, are utilized to assess and compare the performance of proposed networks with that of conventional deep networks (Supplementary Information). Performance robustness against noise is evaluated by using both the noise-corrupted MNIST dataset of handwritten digits and the EMNIST Balanced dataset with varying Peak Signal-To-Noise Ratio (PSNR). The EMNIST Balanced dataset is an extended and comprehensive collection of both handwritten characters and handwritten digits. It extends the classic MNIST dataset by incorporating a more diverse set of pattern classes. The dataset encompasses 47 unique classes of characters and digits, comprising both upper and lower-case alphabetical letters and the digits 0 to 9.

The PSNR is an expression for the ratio between a signal's maximum possible value (power) and the power of distorting noise that influences its representation quality. Mathematically, for a noise-free $m\times n$ monochrome image ‘I’ and its noisy version ‘K’ it can be represented as

$$PSNR = 20 \log_{10} \left( {\frac{{MAX_{I} }}{{\sqrt {MSE} }}} \right)$$

(1)

where,

$$MSE = { }\frac{1}{mn}\mathop \sum \limits_{i = 0}^{m - 1} \mathop \sum \limits_{j = 0}^{n - 1} \left[ {I\left( {i,j} \right) - K\left( {i,j} \right)} \right]^{2}$$

$$MAX_{I} = {\text{Maximum}}\;{\text{possible}}\;{\text{pixel}}\;{\text{value}}\;{\text{of}}\;{\text{the}}\;{\text{image}}$$

Infor-set based convolutional neural network (ISCNN)

Three variants of the ISCNN architectures, namely ISCNN-I, ISCNN-II, and ISCNN-III are proposed, as shown in Fig. 1. The proposed variants modify the CNN by introducing the Infor-Layer and/or replacing the Pooling layer with the Infor-pool. Figure 1 demonstrates one of the possible modifications in the CNN architecture(s). However, the Infor-layer and Infor-Pool can be introduced at different places in the network architecture.

In ISCNN-I, the Pooling layer is replaced with Infor-Pool to extract the localized information and reduce the dimensionality. Whereas in ISCNN-II, the input features are not directly connected to the convolution block; instead, the Infor-Layer is introduced before the convolution block to extract effective information of the signal of interest. ISCNN-III is the fusion of ISCNN-I and ISCNN-II.

The key benefit of introducing the Infor-Layer and Infor-Pool layer is that Inspite of noise present in the input signal the information layer helps to boost the different high level features corresponding to the CNN layers. The Structural Similarity Index (SSIM) helps to validate the above statement. SSIM measures the perceptual difference between two similar images. SSIM value + 1 indicates that the two given images are very similar while a value of 0 indicates the two given images are very different. The Fig. 2 shows the filtered output after the first convolutional layer in the standard CNN and in ISCNN-III for the standard MNIST and EMNIST datasets. These are the trained architectures with clean images only. The figure also presents the SSIM values calculated for the high level features of clean image with respect to the corresponding high level features of the noisy images for standard CNN and ISCNN-III. The standard CNN yielded a Structural Similarity Index (SSIM) of 0.6 ± 0.09 for the MNIST datasets, while the ISCNN-III model produced a higher SSIM of 0.7 ± 0.08. For the EMNIST datasets, the standard CNN recorded an SSIM of 0.4 ± 0.19, and the ISCNN-III model achieved again a higher SSIM of 0.5 ± 0.2. This shows that the information layer introduced in standard CNN helps to boost the high level features. Figure 2 emphasise only on the first layer of CNN but when observed on whole architecture the information layer/or polling introduced at every convolutional layer will enhance the different level of features at the corresponding layers and enhance the performance. The filtered output after each layer of standard CNN and ISCNN-III is shown in Figure S4 and Figure S5.

Comparative evaluation of ISCNN

Three variants of ISCNN, namely ISCNN-I, ISCNN-II, and ISCNN-III are considered, whose architectures are specified in Table 1 and in Table S2 for MNIST and EMNIST dataset respectively . In ISCNN-I, the Max-Pool layers of conventional CNN are replaced with the Infor-Layers, whereas; In ISCNN-II, the Infor-Layer is introduced at a single place only with an exponential membership function; and ISCNN-III uses both the Infor-Layer and Max-Pool, and combines the sigmoid and exponential gain membership functions. Table 2 summarizes the comparison of ISCNN with conventional CNN in terms of varying PSNR for the MNIST dataset. The highest performance is achieved by ISCNN-III. Table 3 presents the comparison between ISCNN-III and CNN on the EMNIST dataset. It is evident from the results that, at a lower PSNR range, the introduction of Infor-layer/Infor-pool significantly improves the performance of conventional CNN.

Table 1 Experimental designs (MNIST): Infor-set based convolutional neural networks.

Full size table

Table 2 Experimental results (MNIST): Infor-set based convolutional neural networks.

Full size table

Table 3 Experimental results (EMNIST): Infor-Set Based Convolutional Neural Networks.

Full size table

Additional metrics including precision, recall, F1, and ROC-AUC score are also computed to analyse and compare ISCNN-III with traditional CNN at PSNR = 6.11 as shown in Tables 4 and 5 for MNIST and EMNIST datasets respectively. The layer details of the ISCNN architectures are shown in Table 6 for MNIST dataset. The layer details for EMNIST dataset are shown in Table S3.

Table 4 Experimental results (MNIST): performance comparison of ISCNN-III with CNN.

Full size table

Table 5 Experimental Results (EMNIST): Performance Comparison of ISCNN-III with CNN.

Full size table

Table 6 Layer details of CNN and ISCNN models.

Full size table

The model is evaluated using fivefold cross-validation. The test set for MNIST has 12,000 samples, which is nearly the same size as the training dataset which is having 10,000 samples. Before splitting, the training dataset is shuffled. The ISCNN-III (the best model) architecture's learning accuracy and loss are shown in Fig. 3A.

While the performance of all ISCNN versions is comparable to that of classic CNN models when the data is clean, it is significantly better than traditional CNN when the test dataset has much more noise. Figure 3B depicts a comparison between CNN and other ISCNN variants with increasing PSNR.

Figure 4 illustrates the confusion matrices for MNIST dataset comparing CNN and ISCNN-III performance with varying PSNR. Figure S6 shows the same comparison for the EMNIST dataset. When images are not contaminated by noise, the performance of classical CNN and information set theory-based CNN is almost similar. However, the performance of classical CNN drastically declines with decreasing values of PSNR. A one-digit sample with two different PSNR values is shown in Figure S1. A number of wrongly classified and correctly classified samples are included in Figure S2 and Figure S3, respectively.

Infor-set based deep feedforward networks (ISNN)

The architecture of ISNN consisting of nodes (circles) and the connections (lines) between the nodes is shown in Fig. 5. The deep neural networks have multiple hidden layers to create deeper representations on each layer. Only two hidden layers have been shown for simplicity and ease of demonstration. The output from the node on the left is connected to the node on the right through a matrix multiplication in between weights and input, which is further passed through an appropriate nonlinear activation function. The output of neuron after each layer is calculated as:

$$a_{j}^{\left( k \right)} = f\left( {\mathop \sum \limits_{{\begin{array}{*{20}c} {i = 1} \\ {j = 1} \\ \end{array} }}^{n} w_{ji} x_{i} + b} \right)$$

(2)

where superscript $(k)$ denotes the layer number, ${x}_{i}=({x}_{1},{x}_{2}\dots , {x}_{n})$ is the input feature, ${w}_{ji}$ is the weight connected from ${i}{th}$ to ${j}{th}$ nodes, and $f$ is a nonlinear activation function.

In ISNN, Infor-Layer is applied after source and after every dense layer. The output is the product of information source values with the corresponding membership function values, as explained earlier. For example, the information set value ${z}_{i}$ is calculated as

$$z_{i} = x_{i} G_{s} \left( {X_{i} } \right)$$

(3)

where, ${G}_{s}\left({X}_{i}\right)$ is the suitable membership function such as sigmoid, exponential, gaussian, etc. Similarly, Infor-Layer is applied after each layer.

Evaluation of infor-set deep feedforward networks (ISNN)

The comparison of the architecture and the results with DL-NN and ISNN is summarized in Tables 7 and 8, respectively. Here, the performance metric of "accuracy" is used to measure how well the model predicts the test data set from both positive and negative classes (Supplementary Information). The 'relu' activation function is used for initial dense layers, and the 'Softmax' activation function for the final layer. The 'sigmoid' is used as a membership function for Infor-Set calculation. With a batch size of 32, training is done over five epochs. Out of six experiments, the best outcomes for each were taken into consideration.

Table 7 Architecture comparison between DL-NN and ISNN for MNIST dataset.

Full size table

Table 8 Performance comparison between DL-NN and ISNN based on different PSNR for MNIST dataset.

Full size table

Description of layer-I and layer-III in Table 7

In ISNN, two Infor-Layers are introduced which do not exist in DL-NN. In layer-1, instead of linear activation function, the Infor-Layer is considered, whereas layer III encapsulates the information from layer -II's output.

In Table 8, for the uncorrupted data (PSNR = Inf) the performance of both the architectures for MNIST dataset is identical. Afterward, with the introduction of noise (with varying PSNR values) the performance of conventional networks deteriorated sharply, whereas ISNN exhibited robustness to noise. It is clear from the table that the ISNN can also maintain high accuracy of 56.17% compared to 36.77% of DL-NN when the input image is corrupted with high noise (PSNR = 6.4: as minimal information of actual image due to noise).

It is evident from Table 9 that even at PSNR = 6.44, the standard architectures DL-NN and CNN yields 36.72% and 65%, respectively. One of the proposed variants yields a significantly better performance of 96.40%, which is almost consistent even at the high noise level.

Table 9 Comparison of different models on noisy MNIST data.

Full size table

Discussion

In the present work, the capability of Infor-Set theory to handle uncertainty is exploited by integrating it with Deep Learning architectures to extract and enhance the actual information buried under the noise. The proposed architecture's efficacy is showcased using the MNIST database of handwritten digits and the EMNIST Balanced dataset, which is an extensive and inclusive compilation of handwritten characters and digits. To validate the proposed approach's noise tolerance, both datasets are degraded with varying levels of Signal-to-Noise Ratio (SNR).

The approach is general and can be easily applied to many deep networks. In the present work, the proposed technique is used to develop one variant of DL-NN and three variants of CNN. In ISNN (a variant of DL-NN), an Infor-Layer is introduced after the input layer and between the standard layers of DL-NN. The three variants of ISCNN are ISCNN-I to ISCNN-III. In ISCNN-I, Infor-Pool layer is introduced to replace the Max-Pool layer in CNN. In this architecture, the localized information is extracted in terms of pooling for the features extracted by filters of CNN. In Max-pool, the maximum value of the window is taken, but it may provide false information due to corrupted input. Whereas, in ISCNN-II, the Infor-Pool layer is introduced just after the input layer. The filters in CNN architecture are used for feature engineering, but if the source is corrupted, the features are not efficient. In ISCNN-III, both the Infor-Pool in place of Max-Pool and Infor-Layer after the input layer is introduced so that the raw data is not directly inserting into the Conv-Layer due to this. The effective information is extracted, which improves the quality of features extracted using the filters of Conv-Layer.

These all architectures are tested with different gain functions and their combinations. In the present work only the selected membership functions are used; however, a dynamic membership function can be devised that can adapt to the environment. For the uncorrupted data, the performance of proposed architectures was identical to conventional architectures; however, there is a significant improvement and consistency in performance on the degraded MNIST and EMNIST data at different levels of PSNR.

It's worth noting that the proposed technique isn't a denoising technique. Instead, it can improve the useful information, which in turn improves noise handling. This does not necessarily imply that the proposed models are denoising; nonetheless, the noise may be reduced as a result of the procedure. The proposed method is less computationally expensive and capable of boosting effective information, which improves noise handling.

One limitation of the proposed method is that Infor-set layers do not contain trainable parameters. Therefore, all of ANN's layers can’t be replaced with Infor-set layers; otherwise, there would be no room for learning. In the future, the Generalised Entropy Function (GEF)¹² will be tested which has the potential to replace the layers of CNN since its parameters not only collect information from the source but can also be trained to capture the relationship between input and output. Furthermore, compared to conventional architectures, the number of parameters that must be learned will be substantially reduced.

Because noise/irrelevant features in the proposed work may execute self-suppression or self-consolation, theoretical insights from combining information set theory with the activation function can be investigated to alleviate overfitting concerns. In the future, this strategy will be applied to broader real-world problems, such as medical image segmentation. The proposed techniques can also be extended for other deep learning architectures such as Autoencoders, U-Net, and RNN, to name a few.

Methods

Information set theory

Fuzzy set theory was proposed to deal with uncertainty present in the crisp sets (conventional sets) by characterizing the imprecise, vague, or missing information. The fuzzy sets are formulated as a pair (member, membership grade), where the membership grade is defined in the interval [0,1] and captures the belongingness of the member in a fuzzy set.

The Information set (Infor-set) theory employs an entropy framework to transform the fuzzy set into information values and create the information set. The aggregation of information values imparts the overall uncertainty.

Conversion of a fuzzy set into an information set

Consider an attribute $X$ with the following set of values:

$${X=\{x}_{1}, {x}_{2}\dots ,{x}_{n}\}$$

(4)

The set of values that ${x}_{i}$ indicates is denoted as

$${I}_{\rm X}=\left\{{I}_{\rm X}\left({x}_{i}\right)\right\} \forall {x}_{i}\in X$$

(5)

where ${I}_{\rm X}\left({x}_{i}\right)$ is the individual information source value and ${I}_{\rm X}$ is the collection of information source values from $X$ . To create fuzzy sets, these values are segregated into K soft classes. The kth fuzzy set (${F}_{X}^{k})$ is represented as the pair ${(I}_{\rm X}\left({x}_{i}\right), {\mu }_{X}^{k} \left({x}_{i}\right))$, where ${I}_{\rm X}\left({x}_{i}\right)\mathrm{ and }{\mu }_{X}^{k} \left({x}_{i}\right)$ are the information source values and the corresponding membership grades, respectively.

To measure the uncertainty in fuzzy sets, the conventional entropy functions such as Shannon and Pal and Pal⁵ are not suitable as they provide the uncertainty in a probability but not possibility domain. Typically, for fuzzy sets, various fuzzy entropy functions are used. However, the limitation of these functions is that they can only capture the uncertainty in membership grades and not in information source values. As a remedy, the generalized Hanman-Anirban entropy function has been developed⁸, which combines the information source values and the associated gain as a single entity termed as “information value”.

For the fuzzy set ${F}_{X}^{k}$, the uncertainty, or information in ${I}_{\rm X}$ is formulated as:

$${E}_{X}^{k}=\sum_{i}{I}_{X}\left({x}_{i}\right){g}_{X}^{k}\left({x}_{i}\right)$$

(6)

with

$${g}_{X}^{k}\left({x}_{i}\right)={e}^{-{({a}_{X}^{k}{{(I}_{X}\left({x}_{i}\right))}^{3}+{b}_{X}^{k}{{(I}_{X}\left({x}_{i}\right))}^{2}+{c}_{X}^{k}{(I}_{X}\left({x}_{i}\right))+{d}_{X}^{k})}^{{\alpha }_{X}^{k}}}$$

where ${g}_{X}^{k}$ is the gain function representing the uncertainty in the information source value. The uncertainty is further converted into the information values (entropy values) in (6). While the gain function is data-dependent, Infor-set theory provides the flexibility to choose different membership functions (Gaussian, Sigmoid, etc.) by optimizing ${a}_{X}^{k}, {b}_{X}^{k}, {c}_{X}^{k}, and {d}_{X}^{k}$ belonging to the kth fuzzy set. All the information values are collected using a generalized Hanman-Anirban entropy function to form the Information set,

$${S}_{X}^{k}=\left\{{I}_{X}\left({x}_{i}\right){g}_{X}^{k}\left({x}_{i}\right)\right\} \forall {x}_{i}\in X$$

(7)

and the sum of all the information values in ${S}_{X}^{k}$ is the effective information ${E}_{X}^{k}$. The normalized effective information ${E}_{XN}^{k}$ is obtained as

$${E}_{XN}^{k}=\frac{1}{\left|X\right|}\sum_{i}{I}_{X}\left({x}_{i}\right){g}_{X}^{k}\left({x}_{i}\right)$$

(8)

The definitions related to Information source values, Information set, and for the normalized information are discussed in³ in more detail.

Procedure to compute the effective information using the Gaussian membership function

Consider a data matrix ${\varvec{X}}$ of size ${\varvec{d}}\times {\varvec{m}}$.

$$X = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {x_{{11}} } \\ {\begin{array}{*{20}c} \vdots \\ {x_{{{\text{i}}1}} } \\ \vdots \\ \end{array} } \\ {x_{{{\text{d}}1}} } \\ \end{array} } & {\begin{array}{*{20}c} {\begin{array}{*{20}c} \cdots \\ {\begin{array}{*{20}c} \ddots \\ \cdots \\ \ddots \\ \end{array} } \\ \cdots \\ \end{array} } & {\begin{array}{*{20}c} {x_{{1{\text{j}}}} } \\ {\begin{array}{*{20}c} \vdots \\ {x_{{{\text{ij}}}} } \\ \vdots \\ \end{array} } \\ {x_{{{\text{dj}}}} } \\ \end{array} } & {\begin{array}{*{20}c} \cdots \\ {\begin{array}{*{20}c} \ddots \\ \cdots \\ \ddots \\ \end{array} } \\ \cdots \\ \end{array} } \\ \end{array} } & {\begin{array}{*{20}c} {x_{{1{\text{m}}}} } \\ {\begin{array}{*{20}c} \vdots \\ {x_{{{\text{im}}}} } \\ \vdots \\ \end{array} } \\ {x_{{{\text{dm}}}} } \\ \end{array} } \\ \end{array} } \right]_{{d \times m}}$$

(9)

Step 1 Computation of mean (${\mu }_{j}$) and variance $( {\sigma }_{j})$ of the jth attribute considering only one soft class per attribute.
$$\mu_{j} = \frac{1}{d}\mathop \sum \limits_{i = 1}^{d} x_{ij} , \quad j = 1, \ldots ,m$$
(10)
$$\sigma_{j} = \mathop \sum \limits_{i = 1}^{d} \left( {x_{ij} - \mu_{j} } \right)^{2} , \quad j = 1, \ldots ,m$$
(11)
where, ${{\varvec{x}}}_{{\varvec{j}}}$ is:
$${{\varvec{x}}}_{{\varvec{j}}}=\left[\begin{array}{c}{x}_{1j}\\ \begin{array}{c}\vdots \\ {x}_{ij}\\ \begin{array}{c}\vdots \\ {x}_{dj}\end{array}\end{array}\end{array}\right], \quad j=1,\dots ,m$$
Step 2 Calculation of membership function value for each information source for the jth attribute.
$${\text{G}}\left( {x_{ij} } \right) = e^{{ - \frac{1}{2}\left( {\frac{{x_{ij} - \mu_{j} }}{{\sigma_{j} }}} \right)^{2} }} , \quad j = 1, \ldots ,m$$
(12)
Step 3 Calculation of information values for the information source matrix ${\mathbb{X}}$.
$$\begin{gathered} S_{X} \left( {x_{ij} } \right) = x_{ij} {\text{G}}\left( {x_{ij} } \right), \quad i = 1, \ldots ,{\text{d }} \hfill \\ \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad j = 1, \ldots ,{\text{m}} \hfill \\ \end{gathered}$$
(13)
where,
$${S}_{X}=\left\{{S}_{X}\left({x}_{ij}\right)\right\}, \quad i=1,\dots ,\mathrm{d } \quad j=1,\dots ,\mathrm{m}$$
Step 4 Computation of effective information
$$E_{i} = \frac{1}{\left| X \right|}\mathop \sum \limits_{{{\text{i}} = 1}}^{d} \left\{ {S_{X} \left( {x_{ij} } \right)} \right\}, \quad j = 1, \ldots ,{\text{m}}$$
(14)

For the noisy data with varying SNR, instead of trained generalized gain function, the sigmoid and exponential membership functions are used, which give comparable results in the experiments. These (sigmoid & exponential) are the standard functions and can be derived from the generalized gain function.

Connection between the information set and ‘swish’

The swish activation function is represented as:

$$f\left( x \right) = x.sigmoid\left( {\beta x} \right)$$

(15)

Here the connection between the 'swish' activation function and the information sets can be clearly observed. In fact, the above equation is a special case of Eq. (6) with gain function ${g}_{X}^{k}\left({x}_{i}\right)=$ $sigmoid(\beta x)$.

Typically, in Artificial Neural Network (ANN) the outcome of every neuron captures the information in the form of $\sum {x}_{i}{w}_{i}$, similar to what Infor-set theory is formulated in the form of $\sum_{i}{I}_{X}\left({x}_{i}\right){g}_{X}\left({x}_{i}\right)$. However, the two has two major differences. First, in ANN, the weight generation does not follow any standard distribution function; instead, training develops these; In contrast, the Infor-set theory acquire weights with the help of a generalized gain function. Second, weights in ANNs are determined by the input–output relationship. The Infor-set theory, on the other hand, does not extract any information based on this relationship; instead, it merely pulls information from input.

Infor-layer

Step 1 Information Source Values

The information source values are attributes/features of an image, which can be represented as
$$I_{X} = \left\{ {I_{X} \left( {x_{ijk} } \right)} \right\},\quad i = 1,2,3; \;j = 1, \ldots ,d; \;k = 1, \ldots ,m$$
(16)
Step 2 Information gain

The Information gain is calculated for each element of the information source in a window with the help a generalized membership function as shown in (4). The present work uses the commonly used membership function(s) (Sigmoid and exponential). Example: The gain value for the sigmoid Membership function is calculated as:
$$G_{s} \left( {X_{ijk} } \right) = \frac{1}{{\left( {1 + \exp \left( { - x_{ijk} } \right) } \right)}}$$
(17)
Step 3 Information set

The information set is obtained by multiplying information source values with the corresponding information gain
$$I_{X}^{1} \left( {X_{ijk} } \right) = x_{ijk} G_{s} \left( {X_{ijk} } \right)$$
(18)

The extracted Infor-Sets are the output of the Infor-Layer.

The proposed operation is general and can be applied to any multi-dimensional data. The procedure is demonstrated in the following steps by taking an example of a 3D image.

Infor-pool

The function of the Pooling layer is to reduce the dimension of the feature map (number of pixels) by capturing the information contained in the region. This reduces the computational complexity of the network and, in turn, speed-up the operation. This operation involves no training of the parameters. However, it has hyperparameters, including the size of the filter, stride, and padding. The most common pooling layers are the Max pooling and the Average pooling. These operations with a stride of two and filter size 2 × 2 are shown in Fig. 6. Although these pooling operations provide exemplary performance in terms of positional invariance and reduce the size and complexity, there is significant information loss.

According to our literature review, little work has been done to address information loss in pooling operations. The maximum value is chosen from the feature map's window in the Max-pooling operation. In this operation, the most prominent value is chosen, while the other values are discarded. On average pooling, the average of the values in a window is selected.

The Infor-set based pooling captures the holistic information of a specific window weighted as per the input. Figure 7 shows the block diagram of the process.

Step 1 Information extraction

Before information extraction, choose a non-overlapping window of size $m\times m \times m$ extracted from a input matrix of size $n\times n\times n$, which is obtained after the convolution operation (where $m<n$). In the figure a $3\times 3\times 3$ window is selected. Afterward, the information from the non-overlapping window is extracted by following the procedure explained in section (Infor-Layer).
Step-2 Collective information calculation

After information extraction, the collective information contained in the selected window is captured with the help of the following equation:
$$E_{Xi} = \frac{1}{M}\mathop \sum \limits_{j = 1}^{m} I_{X}^{1} \left( {X_{ijk} } \right)$$
(19)

To calculate the overall information from the entire input matrix, the window is shifted according to the chosen stride $\mathrm{^{\prime}}k\mathrm{^{\prime}}$ and calculate the next value of collective information by following the procedure as explained. After following the above steps, a new matrix of size S (3 × h × v) is obtained. The values of 'h' and 'v' are calculated as:

$$h = \frac{n + 2p - m}{k} + 1$$

(20)

$$v = \frac{n + 2p - m}{k} + 1$$

(21)

where $\mathrm{^{\prime}}n\mathrm{^{\prime}}$ is the size of the input matrix, $\mathrm{^{\prime}}p\mathrm{^{\prime}}$ is the applied padding, $\mathrm{^{\prime}}m\mathrm{^{\prime}}$ is the size of the window, and $\mathrm{^{\prime}}k\mathrm{^{\prime}}$ is the applied stride. The output matrix obtained after the Infor-Set-based pooling operation will be used as an input for future layers.

The proposed integration can enhance the effective information, which handles the noise better and may be suppressed in the process. This situation can be analysed by taking the following simple example:

Take an image and suppose it is corrupted by adding a salt and pepper noise. Afterward, take a small window for localized information and let this window have more black pixels than white before introducing noise. Then as per the steps of the Infor-set theory, let us derive the Gaussian MFs through the generalized information gain formalism, which as per the theory, gives a measure of uncertainty in the information source values. Afterward, an information set is prepared, which is a collection of the information values corresponding to the original source values, computed using the Hanman–Anirban entropy function. In this situation, if salt falls on any pixel, then according to Infor-set theory, see the belongingness of this particular salt concerning its neighborhood, this is very insignificant because it is dominated by the dark pixels. When salt noise is introduced, the source value of that specific pixel becomes higher, but its belongingness is very low due to the neighborhood. When the effective information is calculated by multiplying the information source with its belongingness, it reduces the effect of salt. On the other hand, in the case of pepper noise means the dark pixel is there. However, its belongingness is high, so if it is multiplied then, it still becomes dark. So, in this particular case, the Infor-set theory is trying to extract the actual information up to some extent and suppress the noise. Table S1 explains the significant differences between the proposed models and the most relevant conventional DL models in design and functionality.

Data availability

In this study, the datasets analyzed can be found in the MNIST Database of handwritten digits repository, which can be accessed via the URL http://yann.lecun.com/exdb/mnist/. The EMNIST dataset, on the other hand, can be obtained from the Kaggle website at https://www.kaggle.com/datasets/crawford/emnist.

References

Goldberger, J. & Ben-Reuven, E. Training deep neural-networks using a noise adaptation layer. In International conference on learning representations (2017).
offe, S. & Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, 448–456 (pmlr, 2015).
Shorten, C. & Khoshgoftaar, T. M. A survey on image data augmentation for deep learning. J. Big Data 6, 1–48 (2019).
Article Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).
MathSciNet MATH Google Scholar
Zhang, Z. & Sabuncu, M. Generalized cross entropy loss for training deep neural networks with noisy labels. Adv. neural information processing systems 31, (2018).
Dodge, S. & Karam, L. Understanding how image quality affects deep neural networks. In 2016 eighth international conference on quality of multimedia experience (QoMEX) 1–6 (IEEE, 2016).
Hanmandlu, M. & Das, A. Content-based image retrieval by information theoretic measure. Def. Sci. J. 61, 415 (2011).
Article Google Scholar
Zadeh, L. A., Klir, G. J. & Yuan, B. Fuzzy Sets, Fuzzy Logic, and Fuzzy Systems: Selected Papers (World Scientific, 1996).
Book Google Scholar
Aggarwal, M. & Hanmandlu, M. Representing uncertainty with information sets. IEEE Trans. Fuzzy Syst. 24, 1–15 (2015).
Article Google Scholar
Hanmandlu, M. Information sets and information processing. Def. Sci. J. 61, 405 (2011).
Article Google Scholar
Pal, N. R. & Pal, S. K. Some properties of the exponential entropy. Inf. Sci. (N Y) 66, 119–137 (1992).
Article MathSciNet MATH Google Scholar
Medikonda, J., Bhardwaj, S. & Madasu, H. An information set-based robust text-independent speaker authentication. Soft Comput. 24, 5271–5287 (2020).
Article Google Scholar
Hanmandlu, M. Robust ear based authentication using local principal independent components. Expert Syst. Appl. 40, 6478–6490 (2013).
Article Google Scholar
Hanmandlu, M. Robust authentication using the unconstrained infrared face images. Expert Syst. Appl. 41, 6494–6511 (2014).
Article Google Scholar
Hanmandlu, M. A new entropy function and a classifier for thermal face recognition. Eng. Appl. Artif. Intell. 36, 269–286 (2014).
Article Google Scholar
Medikonda, J., Madasu, H. & Bijaya Ketan, P. Information set based features for the speed invariant gait recognition. IET Biom. 7, 269–277 (2018).
Article Google Scholar
Grover, J. & Hanmandlu, M. Development of an optimal entropy classifier and prudent learning model. IEEE Trans. Artif. Intell. 3, 164–175 (2021).
Article Google Scholar
Sayeed, F. & Hanmandlu, M. Properties of information sets and information processing with an application to face recognition. Knowl. Inf. Syst. 52, 485–507 (2017).
Article Google Scholar
Grover, J. & Hanmandlu, M. The fusion of multispectral palmprints using the information set based features and classifier. Eng. Appl. Artif. Intell. 67, 111–125 (2018).
Article Google Scholar
Grover, J. & Hanmandlu, M. New evolutionary optimization method based on information sets. Appl. Intell. 48, 3394–3410 (2018).
Article Google Scholar
Ramachandran, P., Zoph, B. & Le, Q. V. Searching for activation functions. arXiv preprint arXiv:1710.05941 (2017).

Download references

Acknowledgements

The research reported in this article was conducted at Thapar Institute of Engineering and Technology in India and Virginia Polytechnic Institute and State University in the USA. Financial support for this work was provided by the US National Institutes of Health under Grant MH110504.

Author information

Authors and Affiliations

Department of Electrical and Instrumentation Engineering, Thapar Institute of Engineering and Technology, Patiala, Punjab, 147004, India
Saurabh Bhardwaj
Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA, 22203, USA
Yizhi Wang, Guoqiang Yu & Yue Wang

Authors

Saurabh Bhardwaj
View author publications
You can also search for this author in PubMed Google Scholar
Yizhi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Guoqiang Yu
View author publications
You can also search for this author in PubMed Google Scholar
Yue Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.B. developed the method and performed comparative evaluation and analysis. S.B. and Y.Z.Wang co-wrote the manuscript. G.Y. and Y.W. edited the manuscript. All authors have discussed the work, and read, edited, and accepted the final manuscript.

Corresponding author

Correspondence to Saurabh Bhardwaj.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Bhardwaj, S., Wang, Y., Yu, G. et al. Information set supported deep learning architectures for improving noisy image classification. Sci Rep 13, 4417 (2023). https://doi.org/10.1038/s41598-023-31462-6

Download citation

Received: 05 July 2022
Accepted: 13 March 2023
Published: 17 March 2023
DOI: https://doi.org/10.1038/s41598-023-31462-6
Springer Nature Limited

Information set supported deep learning architectures for improving noisy image classification

Abstract

Similar content being viewed by others

Open set task augmentation facilitates generalization of deep neural networks trained on small data sets

Impact of Autotuned Fully Connected Layers on Performance of Self-supervised Models for Image Classification

Auto CNN classifier based on knowledge transferred from self-supervised model

Introduction