1 Introduction

Air pollution has become a serious issue globally and threaten public health. Many studies [1, 2] have shown that these pollutants especially fine particles with diameters less than 2.5 μm (PM2.5) has very complicated and harmful effect on human body more susceptible to respiratory diseases (such as asthma, emphysema, pneumonia, etc.), and also likely to increase cardiovascular and cerebrovascular diseases (such as ischemic heart disease, coronary heart disease, myocardial infarction, high blood pressure and cerebral infarction, etc.). Thus, how to measure and reduce the air pollution effectively becomes an important and practical problem.

Nowadays, smart phones and camera surveillances are widely available to obtain images, which together with the ever-increasing computational power for sophisticated image processing, provide a great opportunity to quality and analyze airborne particle based images. The studies that have been reported in the literature can be divided into two categories: image-feature based approaches and deep learning approaches.

In image feature-base methods, Li et al. [3] proposed a method to estimate haze levels from images in image feature-base methods. They get two features, depth map and transmission matrix from haze images. And they use two features to estimate haze levels by statistical methods. Mao et al. [4] proposed a method by detecting numerical haze image by using the statics of various images and the atmospheric scattering model. And this method can estimate the haze factor from a single image. Liu et al. [5] first extracts 6 image features for each image, transmittance, overall and local image contrast, sky color and smoothness, and entropy, and two non-image features, solar zenith angle and humidity, and then applies principal components analysis (PCA) and Sequential Backward Feature Selection (SBFS) to optimize the feature set. Finally, creating a SVR model to predict PM2.5 indices.

Recently, deep learning has become the state-of-the-art solution for solving typical computer vision problems. In CNN based methods, Zhang et al. [6] built a CNN and classify images. The CNN has 9 convolution layers, 2 pooling layers, and 2 dropout layers. And they solve vanishing gradient problem by using a modified rectified liner unit as the activation function. In order to adapt to air pollution problem, they also have to use a negative log-log ordinal classifier to replace softmax classifier. Chakma et al. [7] proposed method applies a VGG-16 CNN model for image-based PM2.5 level analysis. The images are classified into three classes according to their PM2.5 concentration levels based on two major transfer learning strategies, CNN fine-tuning and CNN features-based random forest. Bo et al. [8] first uses a Residual convolutional neural network (ResNet50) to predict the PM2.5 index based on image information, and achieved the-state-of-the-art performance.

Compared with traditional image feature-based PM2.5 analysis, deep learning-based approaches tends to achieve better results due to the simple preprocess and complete feature extraction. The existing networks such as VGG [9], Inception [10], ResNet50 [11] achieved great performance on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). However, the existing networks were designed for object recognition, the complexity of these networks make it more difficult to optimize and easy to get over-fitting for PM2.5 estimation task.

In this paper, we explore the way for how to design a CNN model for air particle pollution estimation. A deep network with fewer layers is presented in this paper for PM2.5 Index Estimation (PMIE). Our contributions are: (1) An inter-layer weights discrimination of convolutional neural networks method is proposed, providing a meaningful reference for CNN’s design. (2) A shallow ResNet with layer enhancement is proposed, which not only improve the convergence speed in the training, but also improve the over-fitting performance. Meanwhile, the training time is greatly reduced due to the shallow network in the case of the same training epochs. (3) The proposed method PMIE achieves a good performance on the dataset [5, 8]. For Shanghai dataset, PMIE reduced RMSE by 11.8% and increased R-squared by 4.8%. For Beijing dataset, our method PMIE outperformed [8], which is reported based on ResNet50. RMSE is reduced by 14.4% and R-squared is increased by 23.6%.

2 Methodology

The complexity of these existing deep networks designed for object recognition made it more difficult to optimize and easy to get over-fitting in our task. Therefore, we proposed a shallow convolutional neural network with layer enhancement for PM Index Estimation, called PMIE. First, we proposed a network consist seven residual block, which is shallow compared to the Residual networks such as ResNet50, ResNet101. In addition, a new method for enhancing the effect of the convolution layer was first introduced and was applied behind the convolution layer that has obvious effect on the output. In our task, we add the layer enhancement following block six and seven. The flowchart was illustrated in the Fig. 1.

Fig. 1.
figure 1

Flowchart of the proposed method

2.1 Layer Weight Distribution Discrimination Method

In the machine translation mechanism [12] of text deep learning, the weight of each word in the process of translation is not same, and more attention is paid to the core words. Similarly, the weight distribution of each layer is also different in a convolutional neural network. According to this consideration, we propose a CNN inter-layer weight distribution discriminant method. For a convolutional network, we assign a random weight Kij to the output of each residual block and train Kij by the back-propagation algorithm which is showed at Fig. 2. The specific approach is as follows: (1) Training basic convolutional neural networks; (2) Fixing the weight of the basic network, and assign a random weight Kij to the output layer of each residual block, i represents the ith residual block, and j represents the jth feature map of this residual block; (3) Training the random weights; (4) Outputting the weight of each layer and seeing the distribution.

Fig. 2.
figure 2

Inter-layer weight distribution discrimination method

Fig. 3.
figure 3

Weight distribution results in Shanghai dataset.

2.2 Shallow ResNet

Applying deep learning methods to images based PM2.5 index Estimation is a challenging task. The existing CNN models are suitable for object recognition tasks, but our task is to explore whether the edge of the object is clear and whether the image texture is clear. The existing deep and large networks are difficult to train and easy to get over-fitting for PM2.5 index estimation task. Therefore, we proposed a shallow ResNet with fewer layers compared to existing architecture. This architecture is presented in Fig. 1, which takes a square 224 * 224 pixels RGB image as input and composed of one convolutional layers, one pooling layers, and seven residual blocks selected from ResNet-50 [11], the select of residual blocks number is from experience, the result is shown in Fig. 4.

Fig. 4.
figure 4

Different residual block number and train method results in Shanghai dataset

Shallow architecture tends to learn low level features such as edges, lines, texture and colors. As the number of model layers deepens, the edges extracted by layers tend to be semantic and gradually change to the shape of objects. In our PM2.5 estimation task, the focus is not to identify the object itself, but to identify whether the edges or lines of the object is clear, the shallow architecture is more suitable for our task.

2.3 Layer Enhancement

The Attention mechanism [13] was previously used in the task of text classification, and recently was widely used in object detection of images. In the object detection task, the initially selected ROI (region of interest) is given a higher weight value for more attention. Inspired by this, we proposed an enhance method, multiplying each weighted probability value learned by the convolutional layer by itself, so that the effects of activation and suppression are doubled, it also means image enhancement. At a convolution layer, the previous layer’s feature maps are convolved with learnable kernels and put through the activation function to form the output feature map. Based on Sect. 2.1 weight distribution results shown in Fig. 3, we add enhancements after residual block six and block seven. That is, each output map may combine convolutions with multiple input maps except that the output after residual block six and seven combine input map multiple with itself. In general, we have that

$$ x_{j}^{l} = \left\{ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {f\left( {\sum\nolimits_{{j \in M_{j} }} {x_{i}^{l - 1} *k_{ij}^{l} + b_{j}^{l} } } \right),} \\ {x_{ij}^{l - 1} *x_{ij}^{l - 1} ,} \\ \end{array} } & {\begin{array}{*{20}l} {l = others} \hfill \\ {l = 7,9} \hfill \\ \end{array} } \\ \end{array} } \right. $$
(1)

Where xl represents the pixel of lth block, Mj represents a selection of input maps. According to these observations, the proposed PMIE with enhancement suppresses the features that have little bit relationship with the output and strengthens those features with greater concern. Several examples of the enhancement are depicted in Fig. 5.

Fig. 5.
figure 5

Layer enhancement samples

2.4 Training and Testing

The PMIE-model is trained by the back-propagation algorithm with batch stochastic gradient descend such that the mean squared error loss is minimized. RGB images from the training dataset are resized to 224 * 224 * 3 and fed to the PMIE-model for training. The observed PM2.5 index of each training image is used to calculate MSE loss. Fine-tuning was adopted due to the limited dataset was not large enough to train the full CNNs. There are two possible approaches of performing fine-tuning in a pretrained network: The first one is to fine-tune all the layers of the CNNs, the other approach is to keep some of the earlier layers fixed and fine-tune higher level layers of the network. In this paper, we fine-tuned all layers using the parameters learned from ImageNet datasets.

Training is done per epoch, the CNN parameters updated based on the best results on the validation set. After training, the testing dataset were fed to the trained model and get the predicted PM2.5 index.

3 Experiments

3.1 Dataset

We present the PM2.5 prediction task on two images datasets: Shanghai dataset and Beijing dataset. (1) Shanghai dataset is a single-scene dataset [6]. This dataset contains 1885 pictures captured at the Oriental Pearl Tower in Shanghai city, China, and contains different capture times from May to December of 2014. (2) Beijing dataset is a non-single scene dataset that contains 1514 pictures collect from Beijing tourist website by ourselves [14]. These pictures were captured at diverse locations in Beijing City, China.

The U.S. consulate in Beijing and a Shanghai provided PM2.5 indices hourly, we used these to retrieve the PM2.5 indices of two datasets. Figure 6 shows the histogram distribution of the PM2.5 index of two datasets.

Fig. 6.
figure 6

The histograms of PM2.5 of Shanghai dataset and Beijing dataset.

3.2 Evaluation Protocols

We use mean squared error (RSME) and R-squared to evaluate the error of prediction. RSME is defined as:

$$ {\text{RMSE}} = \sqrt {\frac{1}{N}\sum\nolimits_{i = 1}^{N} {\left( {y_{i} - y_{i}^{{\prime }} } \right)^{2} ,} } $$
(2)

Where \( \text{y}_{\text{i}}^{{\prime }} \) is the ith forecast value, and yi is the ith observed value, i = 1, 2…. N. And R-squared is defined as:

$$ R^{2} = 1 - \frac{{\sum\nolimits_{i = 1}^{N} {\left( {y_{i} - y_{i}^{{\prime }} } \right)^{2} } }}{{\sum\nolimits_{i = 1}^{N} {\left( {y_{i} - avg\left( y \right)} \right)^{2} } }}, $$
(3)

Where \( \text{y}_{\text{i}}^{{\prime }} \) is the ith forecast value, avg(y) is the average forecast value, and yi is the ith observed value, i = 1, 2…, R-squared increases with the agreement between the observed value and the forecast value with a maximum value of 1, which indicates the best match.

3.3 Experiment Setting

In order to train and evaluate our PMIE network, we randomly select 80% images as training set and tune the CNN, 10% images as validation set, and 10% are used for testing. To fine-tune the model, we loaded the first convolutional and the earlier seven residual block weights from ResNet-50. After loading the pretrained weights, we adjusted all the parameters in PMIE networks to fit our goal.

For the PMIE network training parameters, we ran the code for 500 iterations and set the learning rate to a very small variations of 0.001. The batch size is 64 and the momentum is assigned to 0.9. The program was implemented using keras 2.05 in Ubuntu 16.04.

3.4 Experiments Result

In this study, we measure the performance of the proposed method using evaluation protocol described in Sect. 3.2. Table 1 presents the results of PMIE on two datasets with the RSME and R-squared values with comparison results in [6, 8]. In order to better compare our experiments, we also joined the VGG16 network as a comparison.

Table 1. PMIE performance vs other networks

It is clear that the deep learning method such as ResNet and VGGnet achieved better result than traditional image feature extraction method like literate [6]. Since our network is improved on ResNet50, our focus is on comparing our PMIE methods with ResNet50 reported literate [8]. We can see that our method performed better in the same dataset of Beijing and Shanghai.

For Shanghai dataset, the RMSE of our PMIE method is reduced by 11.86% and R-squared increased by 4.59% in Shanghai dataset with ResNet50 from literate [8]. For Beijing dataset, the RMSE and R-squared values of PMIE are 50.64 and 0.68, reduced by 14.38% and increase by 23.63% respectively. Besides, compared with the traditional resnet50 network, the training time of PMIE is greatly reduced.

Figure 7 shows the correlations between the estimated PM2.5 indices and observed PM2.5 indices of Shanghai dataset and Beijing dataset.

Fig. 7.
figure 7

The correlation of estimated and observed PM2.5 index.

Figure 8 shows the training and validation loss of our PMIE and ResNet50 on two data sets. We can see that the method we proposed overall performs better on validation sets and better to overcome overfitting challenges. In addition, in non-single scene Beijing datasets, our method converges faster during training, and it performs steadier during training and validating.

Fig. 8.
figure 8

Train loss and validation loss of proposed PMIE and ResNe50 for Shanghai and Beijing datasets

Figure 9 shows images with their observed and estimated PM2.5 indices. The first two rows pictures are from Shanghai dataset and the last two rows are from Beijing dataset. The first row and the third row are pictures with accurate prediction, the second row and the last row are pictures with inaccurate prediction. By analyzing the dataset, we find there are some reasons for inaccurate prediction. We can see the images from second and last looks different between human visual observation and image labels. For example, the 1st image in last row looks very clear but its actual PM2.5 index is 80, the 3rd picture in the last row has larger estimated index than observed index because of its gray hue. In addition, lack of high PM2.5 images in dataset also resulted in the inaccurate prediction on high PM2.5 images. One the other hand, it shows that our algorithm is more accurate for most image predictions except these pictures does not match subjective visual bringing lager errors.

Fig. 9.
figure 9

Example images with their observed and estimated PM2.5

4 Conclusion

In this paper, we proposed a PMIE network that using residual block stacking with enhancement to estimate PM2.5 from images. Our main findings are that shallow CNN model and convolutional layer with enhancement provide better performance than typical deep CNN architecture. We also studied the performance for training and validating loss. The results on single scene Shanghai dataset and Non-single scene Beijing dataset outperforming the state-of-the-art.