1 Introduction

Nowadays, the content of programs on networks and social media is changing rapidly. Illegal content on the Internet has also increased considerably. Infant adult content image, for instance, has come to be a common troublesome topic because of the fact that this market could inspire in addition abuse stuff [1] and is an apparent assault to the children dignity by introducing them as sex items.

Automatic detection of adult content images is one of the significant issues in law enforcement and forensic activities [2]. These computational tools could not help you upload or access undesired material for a particular user or location. However, the fast evaluation of adult content images at crime scenes could cause the instant arrest of offenders. Once the process of the adult content image detection is completed; Techniques may be carried out on some parts of the body, such as the face or private body parts, which can help filter out movies and achieve evidence at some stage in an investigation.

In recent years, adult content image recognition methods have grown exponentially and are used in various contexts [3]. There are several obstacles to real-time multi-image recognition systems, especially in densely populated areas. However, critical challenges for these systems include computation time, brightness conditions, and increased misalignment rates due to poor image quality [4]. Many images have become available with the rapid growth of social networks. Hence, there are many adult content websites available for children and institutions, some of which are authorized [5]. Recently, adult content detection has motivated the development of several image content-based approaches in computer vision. These approaches can be divided into three categories: rule-based methods, learning-based models, and image retrieval-based approaches. In environments that cannot be managed, minor changes in different locations for short periods are a significant obstacle to image recognition systems. An essential drawback in applications being utilized in actual time is the fact that photo recognition structures cannot utilize numerous illuminations [6].

Variations in the lighting conditions, increase number of targets in the environment, and low-quality streams can hurt the performance of an image recognition system. This research proposes novel methods for utilizing multiphase recognition systems in actual time. The new method of skin color segmentation has been proposed in this paper, which is used to reduce misclassification rates and the calculation duration used. It uses a structural architecture with a parallel configuration to filter out the dataset using the related skin color segmentation. It extracts features to hand over to the recognition engine, which is being developed in this system. The Gaussian-Bernoulli limited-time used for the features extraction for an adult content image. The following sections will discuss related work, methodology, experimental results, and methods.

2 Related works

Most of the adult content image detection techniques focus on the recognition of people’s skin. The policies centered on this method assume that shade Adult content images encompass many pores and skin regions. In this type, the focus in mainly on low-level features, such as color and trendy sharing styles. This technique is substantially explored inside the photograph area [7].

A standard method for classifying images is the use of Convolutional Neural Networks(CNN's). It consists of two elements: feature learning and classification. The class element consists of several dense bonding layers that require the crushing layer and the final Soft-Max layer [8].

The most important region of interest in detection is the human face. If the input contains an excessively large region of skin, it is considered a relevant indicator for nudity. However, the face is not always the most straightforward element that affects the output because a given image can contain many skin tones, and the image is non-adult content [9].

Formerly, the maximum vital function was the human complexion. If the input incorporates too many pores and skin, it turns into an associate indicator of nudeness. However, the complexion is not the best element that influences the output. A given photograph may include a lot of pores and skin coloration and is non-adult content [10, 11].

First, adult content detectors were primarily created for the study of skin characteristics. These processes focus on a variety of methods to determine if the pixels in a particular image belong to the skin pores. In the area of nudity detection and adult content, image recognition has a long story in computer vision. Primary work focused mainly on discovering nude people in images based on the human structure model [12]. However, this research requires that adult content images be naked and simple. In contrast, adult content images on the Internet generally exhibit broad background, scale, scenario, and human pose variations.

The authors of [13] offered a method consisting a combination of GoogleNet and ResNet [14] for the function extraction and a Recurrent Neural network, based on lengthy short-time period memory devices, for the video category. The offered method is called ACORDE (grownup content material recognition with Deep Neural Networks) and is evaluated at the pornography-800 facts-set [2], with an accuracy of 95%; 6% in the first-rate configuration (ResNet-one hundred and one).

Vakili et al. [15] employed a content-based image retrieval method to detect adult content images according to which the rectangular interest region was acquired using skin-like pixel detection [16]. Here, characteristics such as texture, shape, and color for each image were utilized to retrieve the 100 most similar images from the dataset of images in which both non-adult and adult images were available. Finally, the images had been classified primarily based on a pre-decided threshold of the range of retrieved adult images. Nevertheless, since there is a significant interclass difference between adult content images, creating a dataset appears to be very difficult. Moreover, designing the proper low-level characteristics is considered problematic. In another research [17, 18], a content-based image retrieval method was employed to investigate the presence of humans in the images. According to the affirmation of adult content present in the images, the analysis of skin color in this survey was conducted by detecting skin regions. Due to the difficulty in detecting humans in an image, approaches applying learning methods are offered [19].

Deep learning is in charge of a lot of new revolutions in image classification tasks [20]. Alex Net and Google Net architectures have been utilized on select frames. For classifying non-adult content and adult content used in the integrity of films, the result for a video is determined via a standard voting process [21]. Weights learning has also been utilized from the Image Net dataset; fine-tuning is utilized only in the last layer, which matches the classifier.

The Region of Interested (RIO) is also one of the approaches utilized to extract features for image recognition. In adult content, skin regions have been considered as the ROIs. Multi-hand-crafted features such as texture, color, and shape have been extracted for recognition based on the detected skin region. The area-primarily based strategies are sturdy against historical past muddle when compared to feature-based approaches. However, they run the danger of wrong ROI detection since skin detection is a complex trouble by itself [22].

Defining the number of adult content images to feature extraction, bottom, breast and belly, and training consistent body part indicators for these features are the basis of body part-based approaches [23].

After scanning the image based on the body part-based approach, the final result is sorted into the classification semantic characteristic vector. However, these frame component detectors are ambiguous and, due to the help of small patches and full-size versions in some stages of training, are likely to produce excellent false detection [24]. Adult content detection has been applied for a deep neural network in their research work. Alexnet and GoogleNet methods are used in this concern [25]. The training phase has been reusing models pre-trained over the ImageNet dataset and NPDI used for find-tunes.

In this paper [26], two models have been typically accompanied. First, Restricted Boltzmann Machines (RBM) has been trained in an unsupervised manner to train the RBM to model the distribution of inputs. RBM is applied in one of two ways: either the hidden layer is utilized to pre-process the input data by changing it with the layer shown by the hidden layer, or the RBM parameters are used to initialize the neural network feeder. The RBM is paired with some different getting-to-know set of rules to resolve the supervised studying hassle at hand in each case. Unfortunately, this model requires one to tune both settings of hyper-parameters simultaneously. Moreover, because RBM is skilled in an unsupervised manner, it is ignorant of the nature of the supervised venture that wishes to be solved and offers no ensures that the information extracted by its hidden layer might be helpful.

In [27], bag-of-visual-words and skin detection have been compared to CNNs to detect adult content images. The research uses 650,000 images, and the AlexNet architecture had got accuracy better than another technique. The authors have used a GoogLeNet architecture and the NPDI dataset for training and testing, respectively. Moreover, the authors also introduced a novel approach that uses Multiple Instance Learning (MIL) [28]. The model is trained to label arbitrary parts of an image, and if one of them is considered as adult content, the entire image will also be. This paper uses a balanced dataset containing approximately 234,000 images.

Content-based image retrieval technique is utilized to detect a pornographic image. Based on the detection of skin pixels, a rectangular area is obtained. For each image, features such as color, texture, and shape are used to retrieve 100 similar images from its image database, including adult and non-adult images. Finally, images are classified with a predefined threshold of the number of adult images recovered [29]. However, building a database is very difficult due to the considerable diversity in the class of pornographic images. In addition, designing suitable low-level features is difficult.

3 Restricted Boltzmann Machines

RBM is an effective technique employed in DBNs and is one of the Markov random area models. It changed into proposed through Smolensky in 1986 based on the Boltzmann system [28]. The RBM is a stochastic generative neural community that could research the chance distribution from the entered datasets. RBMs are applied in measurement reduction, collaborative filtering, function learning, subject matter modeling, radar goal computerized reputation, chip synthesis, and speech recognition. RBMs can be educated through supervised or unsupervised learning, relying on distinct obligations. DL and RBM have obvious advantages for great unlabeled facts, specifically inside the context of extensive significant records.

A restricted Boltzmann device is an artificial neural community for devices studying probability distributions. Using Eq. 1, an RBM describes a chance distribution over couples of vectors, 1 V ∈ zero, 1 NV and H ∈ 0, 1 NH.

$$p(\upsilon ,h) = (Pv,H = h) = \exp (\upsilon^{{ \top }} b_{V} + h^{{ \top }} b_{H} + \upsilon^{{ \top }} Wh)/Z$$
(1)

where \(b_{v}\) and \(b_{H}\) represent the vector of biases for visible and hidden vectors, respectively, and W show the connection weights matrix. The quantity \(Z = Z(b_{V} ,b_{H} ,W)\) is the value of the partition function that ensures that Eq. (1) is a valid probability distribution. The conditional distributions P(H|v) and P(V|h) are factorial and are given by P(H(j) = 1|v) = s(bH + W⊤v) (j) and P(V (i) = 1|h) = s(bV + W h) (i), where x (j) shows the jth factor of the vector x and s(x) (j) = (1 + exp(− x (j))) − 1 is the logistic function. We utilized i and j to indexes visible vectors index hidden vectors. Equation 2 allows vector V to take real values. Figure 1 shows the General Boltzmann Machine and Restricted Boltzmann Machine

$$P(\upsilon ,h) = \exp ( - \left\| v \right\|^{2} /2 + \upsilon^{{ \top }} bv + h^{{ \top }} b_{H} + \upsilon^{{ \top }} Wh)/Z$$
(2)
Fig. 1
figure 1

Left: A general Boltzmann machine. The top layer represents a vector of stochastic binary “hidden” features and the bottom layer represents a vector of stochastic binary “visible” variables. Right: A restricted Boltzmann machine with no hidden-to-hidden and no visible-to-visible connections

The gradients and the conditional distribution P(H|v) do not change using Eq. 2.

The only change in Eq. 2 is in the dependent distribution P(V|h), equal to a multivariate Gaussian with N (bV + W h, I).

Equation 3 shows the gradient of the average log probability given a dataset,

$$\begin{gathered} S,L = 1/\,ISOI\,\,P\;\;v \in S\log P(v), \hfill \\ \partial L/\partial W = \, \left\langle {V.H^{{ \top }} } \right\rangle_{{P(H|V)\tilde{P}(V)}} - \left\langle {V.H^{{ \top }} } \right\rangle_{P(H,V)} \hfill \\ \end{gathered}$$
(3)

where \(\tilde{p}(v) = 1/\left| S \right|\sum\limits_{v \in S} {\delta_{v} (V)(here\,\,\delta_{\upsilon } (X)}\) is distribution over real-valued vectors that is concentrated at x, and \(\left\langle {f(X)} \right\rangle_{p(X)}\) is expected \(f(X)\) under the distribution \(p\). Computing the exact values of the expectations \(\left\langle . \right\rangle_{p(H,V)}\) is computationally obstinate. Much work has been done on approximate computing values for enough expectations for practical learning and inference tasks. Figure 1 shows the difference of General Boltzmann Machine and Restricted Boltzmann Machine.

3.1 General Boltzmann machine restricted Boltzmann machine.

The purpose of the (RBM) is to reconstruct the inputs as accurately as possible. The inputs are modified by weights and biases in the front pass and used to activate the hidden layer. In the next step, the existing activations from the hidden layer are modified by weights and biases and sent to the input layer for activation. In the input layer, modified activations are viewed as input reconstructions and compared to the original input.

4 Proposed model

This section explains the proposed deep learning methods and the data set used to train and evaluate the model.

4.1 Model structure

Our proposed method focuses on classifying entire images as non-adult content and adult content. A data-driven approach was adopted to understand more semantic and abstract characteristics via deep CNN. The network construction detail and testing and training procedure of the offered adult content image detector are discussed in this section. There were two types of pornographic and non-pornographic images for the collected dataset images. Images were resized to 81 × 81 × 3 dimensions, and our dataset included only private parts of the body. We applied these parameters for training, as shown in Table I.

We also utilized the same parameters for validation, except batch_size that utilized 32 as Table 1. In the first step, we started to build the CNN model. We proposed 5 layers to this model. Between each layer is a Gaussian-Bernoulli limited-time bolt machine used to improve the specific geometric properties of the images. Then they are added to the result of each block. There is consistency between the results of the Gaussian-Bernoulli limited-time bolt machine layer and each of the blocks. This process is applied for every five blocks. At this stage, the training error rate for the results is calculated and then evaluated using validation data. The criterion for stopping training is the number of repetitions 200 times. Finally, the best model is produced and stored based on the minor validation error. Models in Keras can be in two forms: Sequential or Functional Application Program Interface; We proposed a sequential model for most deep learning networks. Through this step, sequential network layers can be easily stacked from input to output in the correct order. After this step, a 2D convolutional layer was added to process the 2D input images. The output channels accounted for the first argument passed to the Conv2D layer function. The kernel size would be the following input defined as a 5 × 5 moving window in this paper, and strides followed it in the x and y directions (1,1). As the next step, the activation function was a rectified linear unit, and lastly, the model with an input equal to the layer size was supplied. Figure 2 indicates the proposed method.

Table 1 Training Parameters
Fig. 2
figure 2

Structure of the proposed method

We have started the initialization using sequential modeling. After initializing the model, we upload 2 layers of convolution that is including 96 channels of 5 × 5 kernel and same padding; 1 × max-pooling layer of 2 × 2 pool size and stride 2 × 2; 2 layers of convolution that is including 96 channels of 5 × 5 kernel and same padding; using 2 max-pooling layers 2 × 2 pool size and stride 2 × 2; 4 layers of convolutional of 128 channel of 22 kernels with the same padding; 2 max-pooling layer with size 2 × 2 and stride 2 × 2; 2 layers of convolution that is including 512 channels of 5 × 5 kernel and same padding; using 2 max pool layer 2 × 2 pool size and stride 2 × 2. As well, we upload ReLU activation to each layer so the wrong values are not exceeded to the next layer. After developing all convolutions, we pass the data to the dense layer to flatten the vector from the convolutions and upload it. We used ReLU activation for each dense layer of 64 gadgets to prevent forwarding bad values via the network. We have utilized a 2 layer with softmax activation as we have been the training to expect from in the end, that might be person and non-person content. In the softmax layer, the output is between 0 and 1, primarily created by making sure the version itself belongs to which class the images belong to.

The final model is presented after creating the softmax layer. Now we need to compile the model. We have used Adam optimizer to achieve a minimum during the training model. The Adam optimizer helps us get out of the nearby and reach the minimum if we get caught during training. We can additionally specify the learning rate of the optimizer.

We specify the getting learning rate of the optimizer. In the proposed model, it is set at 0.003. The learning rate to reach global minima need to decrease if our training bounces significantly on epochs. When the proposed model was created, from Keras model, ModelCheckpoint and EarlyStopping were imported. Model Checkpoint supports saving the model by monitoring a specific parameter of the model. In this article, validation accuracy is monitored by sending val_acc to ModelCheckpoint. If the model's validity in the advanced period is greater than the final period, the model is stored on a disk.

If the parameter I set in EarlyStopping for monitoring no longer increases, EarlyStopping should be used to stop model training early.

ImageDataGenerator is used to transfer data to the model. Training and testing data is sent to fit_generator. Steps_per_epoch specifies the batch size for jumping instructional data to the model in fit_generator and validation_steps. The same test data as shown in Table 2.

Table 2 Trained parameters

4.2 Dataset

400 non-adult content and 400 adult content images have been generated from the NPDI dataset. As well as 20,000 adult content and non-adult content images have been downloaded from websites. Next, we manually set them to a collection of 17,800 adult content and non-adult content images. We divided these images into these classes: a training set including 3,000 normal and 8,000 adult content, a validation set including 3,000 normal and 2,000 adult content images, and a test set 5,300 normal and 1,300 adult content images.

4.3 Evaluation Metrics

The metric used for evaluation in our model is accuracy, recall, precision, and F-score. Several scholars have utilized these metrics to evaluate adult image content models.

The precision (PR) refers to the precision of the classifier’s prediction. However, the degree of reliability is shown in the classifier’s outputs. PR metric is calculated by Eq. 4.

$$PR = \frac{TP}{{TP + FP}}$$
(4)

TP is a collection of samples of adult image content dedicated to the adult image content class by learners. FP is the amount of non-adult image content samples allocated to the adult image content class. RE (Recall) metric images of learner performance depend on the occurrence of a particular class as represented by RE. It is calculated based on Eq. 5.

$$RE = \frac{TP}{{TP + FN}}$$
(5)

TP is genuinely positive, and FN (false negative) is the volume of adult image content samples that is incorrectly assigned to the non-adult image content class. The FR metric is a combination of the PR and RE metrics. FR is calculated by Eq. 6.

$$F = \frac{2 \times RE \times PR}{{RE + PR}}$$
(6)

The accuracy (AC) metric is calculated by Eq. 7.

$$AC = \frac{TP \times TN}{{TP + TN + FP + FN}}$$
(7)

Error is calculated utilizing Eq. 8.

$$Error = 1 - AC$$
(8)

5 Evaluation and discussion

We present and discuss the results in this section, which were gained from the experiments. Tensorflow implemented the suggested adult content images detector. The efficiency of the offered approach was then compared with other methods for adult content image detection on the dataset. A batch size of 16 was employed for training the models using stochastic gradient descent. After 20,000 iterations, the experiments were stopped, and these parameters were trained and exploited to produce a comparison between them and other adult content image detection approaches. Visualizing the weights of the first layer can enable us to decide whether the deep net was trained well or not. A convulsive neural network produces a lot of noise if not trained properly. In addition, the visualizations of the first layer would be random noise if it was merely guessing.

The image retrieval-based adult content image detection approach was tested in this section on our test dataset. According to a research [3], the content-based image retrieval method was employed to determine the presence of humans in the images. To detect the skin region, the researchers analyzed skin color based on confirmation of adult content images. An accuracy rate of 82.3% was obtained in which the failed case was considered normal (adult content). Moreover, it is difficult to detect a human’s presence in the image, and only a small part of the human body is shown in many adult content images [13].

A new method for skin color was presented in another paper [16] for segmenting skin regions in the image. In this study, texture features, area features, and shape features were extracted for classification Neural Network (NN). These approaches were used in our dataset for the experiments. Table 2 depicts the accuracy rate, which was tabulated to be a value of 89.335% according to this table. As we know, the accuracy was low when using the skin color method. The reason for this outcome is the different humans’ skin color. Another approach for adult content image detection is BOW, which is a pioneer state-of-the-art learning-based approach. The efficiency of the suggested detector was compared with other BOW method-based adult content image indicators from our dataset. The method used in [28] was performed according to an allowance of the famous SIFT descriptor. It is named Hue-SIFT, which aimed to add color data to the original SIFT. In this descriptor, Hue-SIFT was utilized to construct a visual dictionary, and SVM was exploited as a classifier. This approach was repeated in our dataset according to LIBSVM [15] software.

Furthermore, in BOW, two kernels (RBF kernel and histogram intersection kernel) have been employed. Table 3 displays the results in which a high value for the accuracy rate was obtained. However, it should be noted that in some cases, detection failed since there were some inherent shortcomings and hand-engineered visual characteristics.

Table 3 Comparison of performance for various adult content image detectors

The comment shows that adult content in an adult content image could appear in any part of the image with different contents. On the other hand, even though some areas are under to adult content, only a few small areas can be considered adult content features. Moreover, skin color plays a pivotal role in identifying private parts of the human body. Hence, compared to white people, it is more difficult to recognize black people in this area. The result is shown in Fig. 3.

Fig. 3
figure 3

Comparing the accuracy of the method on test datasets based on skin color change

When our proposed model was fine-tuned, it set up networks with ImageNet weights; there was a significant improvement in the overall performance of VGG and ResNet networks while maintaining parity, compared to the results without fine-tuning and AlexNet. While dealing with the motion statistics, VGG showed barely higher consequences than AlexNet and ResNet, as shown in Table 4.

Table 4 Loss, accuracy, val-loss, and val-accuracy

The reality that VGG had comparable overall performance to ResNet for fixed information however advanced for movement supports the notion that movement statistics have a specific shape, even next being represented as images, which a few convolutional neural network models are higher to capture. Consequently, a model focused most effectively on movement statistics ought to enhance even more the results. Finally, it is essential to note that the choice for a model should not be more effective based primarily on classification numbers. The VGG results are illustrated in Fig. 4.

Fig. 4
figure 4

Result of VGG

The ROC curve shows the detection capacity for different distinct thresholds. The TP rate (TPR) in terms of the FP rate (FPR) shows by ROC. The ROC curve is an essential method of our valuation as well. The ROC area under the curve (AUC) is another metric that has been utilized for our experiments. Figures 5, 6 and 7 show the result of the AUC of three models used in this paper. As shown in Fig. 6, the VGG model has the best performance among other methods.

Fig. 5
figure 5

AUC result of Alexnet

Fig. 6
figure 6

AUC result of ResNet

Fig. 7
figure 7

AUC result of VGG

The runtime comparison between the same models is shown in Table 5. The GPU uses the original GPU-based floating-point model. The "CPU" applies the original CPU-based floating-point model.

Table 5 Inference time on i7 CPU and 1080Ti GPU

6 Conclusion

Internet access is growing today, and accordingly, image production has expanded worldwide. Despite the benefits of the Internet, there are any threats, including children’s exposure to adult content images that could not be ignored. Even though deep learning algorithms are the greatest in visual content programs, they did not utilize adult image filtering. In this paper, an adult content image recognition approach was proposed. A Gaussian-Bernoulli limited-time was used for feature extraction to describe the images, and these features were summarized using the Boltzmann machine limited in feature in the summary phase to tackle this gap. This method has high efficiency with the minimum difficulty set for our mission. An adult content data acquisition with high efficiency was designed for this purpose, and an augmentation approach was improved using human observation and experience.

Moreover, there is no requirement for many labeled images in these testing and training strategies. Therefore, the proposed model addressed and solved the problem with the object appearing at various scales in an image. This model was then compared with other models for adult content image recognition. Lastly, the experimental results indicated that this method has sufficient accuracy and better performance than prior state-of-the-art approaches.