Dataset selection
Nowadays, in the Big Data era, many companies provide tons of data about almost every topic you can imagine. However, finding good data that is properly labeled is still not trivial at all. Getting more and more specific on the required topic exacerbates this problem. Furthermore, the common problem of the amount of false positives and false negatives that appear when the scene is different to the one used in the training process should be taken into account. Thus, it makes sense to select different datasets in order to simulate different scenarios. Taking this into account, a collection process has been made to obtain enough data. One of the sources considered was the academic community, although the existing publications usually do not publish the datasets and resources used. Nevertheless, one of the first datasets considered for our work was the Gun Movies Database [10], with frames depicting an individual holding a handgun and walking through a room in several positions. However, handgun labels were not available online, so manual labeling was performed. The camera in this dataset is fixed, with a background that does not change, and therefore data augmentation was used to introduce variability. A total of 181 augmented frames were randomly selected from over 800 initial labeled frames, in order to avoid consecutive frames that are almost identical.
In addition, another obvious source of CCTV-like images was considered: the Internet. Many video-based platforms have several recordings of people training with real guns. Therefore, a set of videos containing this kind of images was downloaded and labeled properly to be used in this work. Unfortunately, many of these frames are blurry due to bad quality cameras, so an appropriate frame selection process was required too. A set of 837 frames was obtained from eleven videos downloaded from the YouTube platform.
In any case, although there were many videos on the Internet, the camera perspective is usually positioned in first person, and the full bodies and skeleton poses where not as visible as it would be expected from a surveillance camera. Our requirement of different poses holding a handgun led us to resort to synthetic images. Although the synthetic generation of images for handgun detection purposes has already been studied [21], the creation of realistic skeleton animations is out of the scope of this work. Therefore, videogames like first-person shooters were considered as a source of images with handgun-like poses and a high level of realism.
An automatic mechanism for image retrieval was mandatory to obtain a large amount of frames with different poses. The NVIDIA Ansel technology was considered for this task, since it 'is a powerful photo mode that lets you take photographs of your games' and allows 360-degree movement, according to its official website. This technology provides the ability to pause the game (if the game is supported by the library), move the camera freely around the scene and save a picture of it. However, the supported games list is rather limited, so the selected videogame was Watch Dogs 2 which has a high level of realism and allows the selection of different guns. It also provides different shooting poses that can be recorded. The Ansel technology allowed the acquisition of eight videos of different poses and different angles, which were then labeled and integrated with the previous datasets. Examples of the final dataset are shown in Fig. 1.
Human pose acquisition
Once all images were gathered and labeled, the additional information required for this work is the human pose estimation on those images.
There are two typical approaches for the problem of human pose estimation, top-down and bottom-up, according to the way the detection is made:
-
Top-down approaches are based on the sequential combination of a person detector and a single body part estimator, which is applied over each of the detections from the first step.
-
Bottom-up approaches follow an inverse order, in which all body parts are detected directly on the image and then a grouping step is executed to associate limbs belonging to the same person.
Top-down approaches are easier to implement than the grouping algorithm of the bottom-up ones. However, it is hard to make a distinction on which of the two approaches provides better results.
A distinction on the output these estimators provide can also be made. The provided information is usually common between all of them, providing the keypoints of the detected limbs on the image, but the difference relies on the dimensions of those keypoints, where some of them provide a two-dimensional output and others estimate the 3D position of the keypoints.
In our case, OpenPose [17] was selected as the body pose estimator because of its speed and capability of detecting up to 25 pose keypoints of human bodies in the image. This pose estimator offers a bottom-up approach that detects the keypoints of the limbs on the image and then uses a set of convolutional layers to predict Part Affinity Fields (PAFs), which represent the association between the detected parts. Using a bottom-up approach offers the advantage of independence in execution speed with respect to the number of individuals in the image, as opposed to top-down approaches that execute code per person detected.
OpenPose offers 2D and 3D estimation of the human pose, but the later is only obtained from multiple views of the same scene. Other architectures such as VNect provide a three-dimensional output from monocular images, but 2D estimation was chosen for this project.
As a result, OpenPose was applied to all the images of the dataset. This step stores the detected keypoints on the images in JSON format, as well as the PAF maps in the form of grayscale images.
Dataset split
The dataset was split into training, validation and test subsets. A special consideration was taken into account: The test subset should simulate the original problem of using the detector in a different scene than the ones it was trained with. Since the dataset is composed of many videos with different scenarios, they were separated according to this principle and balancing the number of synthetic and real images on both the training and test sets.
-
Training and validation The YouTube videos and half of the eight videos obtained from the Watch Dogs 2 videogame were selected for the training step of the models. 60% of each video was dedicated to training and 20% was used for validation.
-
Test The remaining 20% of the previous videos was reserved for testing the models. The Gun Movies Database frames and the other half of the Watch Dogs 2 videos were reserved for the final comparison between the models.
To make the comparison as fair as possible, both the baseline detector and the detection and pose combination model will be trained with the same images, and then the performance of both will be evaluated using the reserved videos that contain different scenarios. A visual representation of the dataset split can be seen in Fig. 2.