1 Introduction

Object detecting schemes for industrial application have to balance requirements of minimal cost, system miniaturization, and the ease of deployment. Current trend towards the object detection system is gradually being implemented via replacing with sensing nodes, and wireless devices hence adapt to the changing environments [1]. To this end, the wireless sensor network system is deemed best able to extend control from cyberspace into physical world [2]. In recent years, the promises of wireless sensing network lead to the development of visual object identification and tracking, with the total exclusion of thousands of meters of electric cables [3]. The design of camera-based network, however, shows the potential of object detection in the fields of healthcare monitoring, military operation, and home security [46]. As the highly developed technology, high-resolution cameras integrated with wireless sensing networks transform the system to real-time image sensing platforms, namely wireless visual sensor networks, allowing us to interpret multiple signals to visual exhibition. In recent years, the application of wireless image sensing networks in surveillance is most pronounced specifically because of its non-intrusive fashion of data collection [7, 8]. As a result, sensing elements such as infrared, video and radio frequency identification, bridge the gap between the real physical world and the virtual space [911]. Methodologies on security monitoring are required for wireless sensor network-based object detection researching and developing.

In the network design process, issues of system reliability, high precision, and fast response are evaluated to set the baseline. As such, its working parameters are crucial for object detection. A most fundamental setting involves the basic processing approaches. The detection approach contains target capture, object outlining and labeling, bounding box generating, object recognition, and so forth. However, due to the demand of real-time image processing, the detection algorithm is a complex task, especially for moving objects.

Inspired by the ongoing research in image processing, we design and deploy a surveillance scheme by using the state-of-the-art detection model, namely SSD (Single Shot MultiBox Detector), which combines the high detection accuracy with the real-time speed [12]. Thus, in the research, we propose an improved SSD with the integration of wireless sensor network, aiming at accomplish current visual object tracking system. Following an efficient end-to-end detection process, the objective of this work is to provide an image capture and processing strategy to optimize the security monitoring via the use of wireless visual sensing network.

The remaining of this paper is organized as follows: the background theories used in this study are introduced in Section 2; the system design and hardware architecture are presented in the Section 3 followed by the object detection algorithm. The experimental outcomes and data analysis results are shown in Section 5. Conclusions with discussion are given in the last section.

2 Preliminaries

2.1 WiFi-based signal transmission

Contactless sensing technology is of rising demand in daily life while the non-invasive detection has raised extensive interests during the past decade [1113]. Currently, WiFi networks are the most widespread signal exchanging mode for Internet access and local area connection networks, such as an in-home WiFi network involving both mobile and stationary devices [14]. As long as the wireless transmitting has its most profound impact on the outdoor fields, the highly developed wireless technology provides opportunities to exploit the WiFi-based sensor networks [15]. In contrast to the traditional wireless sensor network based on Zigbee modules, WiFi transmission reaches a 300 Mbps data transfer rate, which is more efficient and less delay [16]. On the other hand, since WiFi network has already been deployed and applied in indoor environment, significant hardware cost will be saved by expanding the system on the existing WiFi network resources. In addition, each WiFi communication node can easily support 100 wireless connections [17].

Figure 1 shows the typical structure of the WiFi-based wireless sensor network, which contains the router nodes, the routing base station, and the host computer [18]. The design of the system is on the foundation of two principles: (1) each detection node is equal to any other, and (2) each communication path is parallel to any other. Specifically, router nodes are set in field beforehand which will search and connect to the WiFi base station automatically. For the purpose of data exchanging, the routing base station is the bridge between the wireless routers and the host computer. All the router nodes are initialized in the state of listening. The host computer sends commands to the nodes via the routing station, i.e., signals are acquired and transmitted to the host computer according to the collection command. All commands are generated by the host computer. Whereas, when the stop command is sent, the state of the nodes will turn into standby (also termed as low-power mode) until the next command reaches [19].

Fig. 1
figure 1

Classical architecture of WiFi-based wireless sensor network

2.2 Basic model of SSD

We now briefly define the working principle of SSD. For more details refer to technical report [11, 20]. Based on the deep learning paradigm, the Single Shot MultiBox Detector (SSD) is on the foundation of a standard architecture, which is named as base network. Originally, the proposed base network is revised VGG16 [21]. Further, convolutional feature layers are added to predict detections at multiple scales while each layer produces a set of detection predictions via convolutional filters. In addition, a set of default bounding boxes over different aspect ratios are involved within each feature map cell. A typical SSD 300 structure is given in Fig. 2.

Fig. 2
figure 2

Architecture of SSD 300

In the VGG16 version of SSD, prescribed grid scales of each feature map cell are delicately designed from different prediction layers [22]. Accordingly, every single point on the feature maps can witness a sufficiently large area from the input image. Typically, SSD mitigates the deficiencies of slow inference time and large computational cost by making predictions from multi-scale feature maps in a hierarchical mode. What is more, the recognition accuracy of SSD outperforms other object detectors, such the Faster R-CNN and YOLO (You Look Only Once), on the testing image datasets from PASCAL VOC [23, 24] and MS COCO [25]. For these reasons, the SSD algorithm holds great promise for effective target detection capabilities in more fields with the specific demands or image processing.

3 System architecture

A wireless visual sensing network integrates a number of camera to collect time-series scenarios, characterize, and identify intrusion targets [26]. Thereby, corresponding defense actions will be taken in a timely manner [27]. In this study, we thus focus on the development of security monitoring system based on image processing approaches, through which we can acquire some universal methods to carry out a surveillance system. A wireless camera sensing network is designed and built up in our system. The hardware frame diagram of the security monitoring system is shown in Fig. 3. A variety of scenarios as well as their location information are collected from the camera nodes. As such, the AMN14112 camera sensing element, which is able to detecting objects within a range of 10 m, is deployed in the network. Each sensing node is hard wired to a communication unit within a predefined area.

Fig. 3
figure 3

Hardware frame diagram of security monitoring system

For the purpose of communication, WiFi-based wireless transmission modules are employed. which exchanges information based on wireless sensor network. According to Section 2, the wireless routers is applied to transmitting the sensing signals from the sensors due to its cost-effectiveness in long-distance utilization. Therefore, the main function of the WiFi network is for delivering captured images from the cameras to the host computer via mutual cooperation (Fig. 4). In this study, the transmitting principle IEEE 802.11b is employed considering the less loss of continuous 2.4 GHz ISM band [28]. To match the transmission path, the gain is set as 36 dB in line with the transmitting power 1.5 W. As mentioned before, the WiFi-based transmission can reach a 300 Mbps data transfer rate, but the packet loss still cannot be cleared up [29]. Therefore, the image transmission path is provided according to energy efficiency considerations which is an open-loop transmitting scheme.

Fig. 4
figure 4

Hardware frame diagram of security monitoring system

Visual signals are hereafter transferred to the computer terminal for display and storage. Consequently, the camera sensing network is extended from the sensor node to the image processing methodology. Targets are identified through the object detection algorithm simultaneously. Users can further analyze the visual information through the link to the host computer. The system architecture is shown in Fig. 3. Undirected graph will be converted into a directed graph and then stored in the computer, to facilitate the processing. In this way, the recognition of the object based on visual sensing can be applied to security monitoring system.

4 Image processing methodology

4.1 Detection dataset generating

Until a few years ago, the object detection datasets are commonly employed [30, 31]. An increasing number of the classification samples are made into detection samples for model training and testing [32]. In this research, a deep study, however, of the outdoor environmental objects is conducted to train the detection model via given targets after the labeling and identifying process. We shall thus propose the principle of making a detection sample from a classification sample.

The idea of depicting a detection sample is based on computing the coordinate of a given object D:D=cx,cy,w,h,category where cx and cy represent the coordinate of the ground truth box center, w and h indicate the size via width and height, and category is the label.

In other words, the coordinate vector of an input image sample segmented into m×n pixels is defined as

$$ L=\left\{\vec{l_{1}},\vec{l_{2}},...,\vec{l_{m}}\right\}, D=\left\{\vec{d_{1}},\vec{d_{2}},...,\vec{d_{n}}\right\} $$
(1)

where L and D are the left border and bottom border of the image and given as feature vectors of coordinate. A classical algorithm to get the target boundary is clustering. For this system, one single detection object Oj is clustered from a set of pixels OjO={O1,O2,...,OM}. Also, we have the corresponding coordinate vector \(C^{j}=\left \{\vec {c}_{1}^{j},\vec {c}_{2}^{j},...,\vec {c}_{K}^{j}\right \}\). Let us concentrate now on the size of the object Oj

$$ T^{j}=\left\{\vec{t}_{u}^{j},\vec{t}_{d}^{j},\vec{t}_{l}^{j},\vec{t}_{r}^{j}\right\} $$
(2)

and define the parameters

$$ \begin{aligned} \vec{t}_{u}^{j}=\max \limits_{k \in K} \text{distance}\left(D, \vec{c}_{k}^{j}\right)\\ \vec{t}_{d}^{j}=\min \limits_{k \in K} \text{distance}\left(D, \vec{c}_{k}^{j}\right)\\ \vec{t}_{l}^{j}=\min \limits_{k \in K} \text{distance}\left(L, \vec{c}_{k}^{j}\right)\\ \vec{t}_{r}^{j}=\max \limits_{k \in K} \text{distance}\left(L, \vec{c}_{k}^{j}\right) \end{aligned} $$
(3)

where \(t_{u}^{j}, t_{d}^{j}, t_{i}^{j}\), and \(t_{r}^{j}\) represent the upper, bottom, left, and right vertex of object and distance() is the function of vertex distance computing.

Generally, the computation object boundary are facilitated by using the anchor box coordinates whose vertexes are (x1,y1),(x2,Δy),(Δx,y3), and (x4,y4). Observed from the anchor box vertexes, the basic parameters are immediate as shown in Eq. 4.

$$ \begin{aligned} w&=x_{4}-\Delta x \\ h&=y_{1}-\Delta y \\ (c_{x},c_{y})&=(\Delta x +w, \Delta y + h) \end{aligned} $$
(4)

Before referring to a specific label, the confidence value is calculated (Fig. 4). In case that confidence ≥0.87, we get the ground truth box with the label \(\left \{\left (c_{x}^{j},c_{y}^{j}\right),w,h,c\right \}\).

Note that when processing the surveillance images, the detection samples can be used for training the models in application. Therefore, it becomes important for the security monitoring algorithms when dealing with objects in the daily life environment.

The framework of our proposed model is presented in Fig. 5, which takes a classical VGG 16 as its base network for feature mapping [21]. Instead of using SSD directly, we finetune the VGG 16 by holding the basic layer conv43 and transferring FC6 and FC7 into the convolution layers [33]. According to Fig. 5, the convolutional layers in our model decrease in size progressively.

Fig. 5
figure 5

Architecture of proposed model

Similar to SSD, bounding boxes are applied to each layer. In order to identify the small object from the background, intuitively, we go through the images and the feature maps from the top layer to the bottom one. As the feature map sizes and the image resolution differ a lot, we need to delicately assign the bounding boxes to facilitate the processing. For the purpose of image segmentation, the bounding boxes are determined not only by the layer but also by the cell. Note that bounding boxes cannot always recognize the targets exactly, we put more boxes in cell from low-resolution layers and less in high-resolution ones. Generally, four offsets are given in related to every single bounding box of each cell. In this case, however the size of cell varies, the objects can be defined more effectively. As a result, the proposed model becomes new feature maps for expediting small objects detection.

Features of each bounding box are extracted via a 3×3 convolution computation in each layer. As long as SSD branches out into prediction procedures, we additionally add a classifier and a regressor after the basic convolutional layers. Since the Softmax classifier makes a good working performance in SSD, we employ it here to make a prediction of the category of each object. Further, the outcomes from a linear regressor is the offset between the prior bounding box and ground truth box. Both implementations use the output from the convolution layers. After doing so, the final detection decisions are made through the non-maximum suppression.

The average intersection over union (IOU) is used as a key parameter to indicate the detection accuracy [34, 35]. We thus give a characterization of the current object detection models for which the property hold. As shown in Table 1, in most often considered surveillance cases, the proposed model obtains a better IOU via different bounding boxes deployed, from which we comprehend that the default boxes from SSD are not efficient for small objects recognition.

Table 1 Network configuration parameters

5 Experimental

Experiments are carried out on the proposed security monitoring system for wireless sensor network and the object detection algorithm evaluation. In this test, thirty AMN14112 cameras are utilized for visual signal collection. During the monitoring period, we choose one scenario of intrusion from all the acquired images. The image is sensed by one of the camera node in the network. While connected to the base station, visual signals are delivered with the help of WiFi routing modules. The signal paths correspond to the principle used for WiFi communication purpose. After the visual data transmitted to the host computer, raw images are then processed with the aforementioned model. The model is applied to the surveillance system after training, which aims at minimizing the loss of objective function and improving the detection accuracy [36]. To further optimize the working performance of the model, we make the object detection samples from images from the surveillance system with the principle described in Section 4.1 and pre-train our model with them with 150k iterations. In the identification phase, images are segmented according to different resolution. The bounding boxes are generated and assigned to every single cell within the image. Features of different resolution are extracted through various conventional layers in the detection network. The category of the object is predicted by the Softmax classifier, and the bounding box offsets are calculated by a linear regressor. Experimental results of the system are exhibited in Fig. 6.

Fig. 6
figure 6

Adaptive multi-resolution boundary box detection

In addition, we activate our system to more different experimental settings in surveillance, specifically for small object detection. Table 2 in particular show that the proposed model has a higher detection accuracy in comparison to the state-of-the-art approaches. The mean average precision (mAP), which is specified by 11-points interpolated average precision, is taken as a basic parameter for quantitively examining the recognition precision [37].

Table 2 Small object detection results on PASCAL VOC2007

Experimental outcomes provide evidence that our proposed model outperforms other methods and reaches a 78.1% mAP. Distinctively, the effectiveness of the classifier and the regressor is significant which can be observed from the contrast with R-CNN. In favor of further assessing the working property, the recognition outcomes of different objects are recorded. Comparing to SSD, our proposed model is shown to boost the working performance of original SSD (Fig. 7).

Fig. 7
figure 7

Small object detection outcomes comparing to SSD

On the other hand, the runtime as well as the system occupation are computed for further maintenance. As long as the image processing model maintains a speed of 24 frame per second, the object can be detected in real time. According to the equipment configuration, the sampling frequency of the sensing cameras is set as 25 fps while the resolution is 1280×720×30. In consideration of the signal delivery, the WiFi-based transmission speed reaches 3.5 Mbps; and thus, the bandwidth of every single communication path is 3.5 Mb/bps ×30 = 1050 Mb. As long as the resource occupation is approximately 1GB per second, we take the network card of 2.5GB/s to ensure the system stability. Likewise, the frame buffer is set as 9 GB for matching visual signal processing demand. With the number of cameras and transmitting nodes added, the occupation of the system has to be added as well.

6 Conclusions

In this paper, a security monitoring system is described. The system architecture is obtained via the wireless sensor network by using cameras as the sensing elements. We then send, based on WiFi transmission principle, the visual signals to the image processing approach conducted on the host computer. The method employed in this system is based on the state-of-the-art object detection algorithm SSD. This work presents and extends the original SSD by analyzing the images captured from the surveillance system. The corresponding detection dataset is constructed for model training and revising.

We conduct experiments on the proposed system to evaluate the working performance related to the wireless sensing network. Results indicate that our model outperforms the current image recognition algorithms in detection accuracy. The configuration of both the equipment and the communication path are also reported for further deployment, which offers an opportunity to the security monitoring system devising demands.

Future work will focus on more complex situations where multiple sensing elements are used in the system to obtain more exact information for security monitoring. Research is ongoing to verify whether the current system can also be extended to a multi-visual case. However, it does make sense that the system can precisely detect the object of the collected images, it is still an open question whether it can be integrated with other functions and applied to different surveillance systems.