Sound of guns: digital forensics of gun audio samples meets artificial intelligence

Classifying a weapon based on its muzzle blast is a challenging task that has significant applications in various security and military fields. Most of the existing works rely on ad-hoc deployment of spatially diverse microphone sensors to capture multiple replicas of the same gunshot, which enables accurate detection and identification of the acoustic source. However, carefully controlled setups are difficult to obtain in scenarios such as crime scene forensics, making the aforementioned techniques inapplicable and impractical. We introduce a novel technique that requires zero knowledge about the recording setup and is completely agnostic to the relative positions of both the microphone and shooter. Our solution can identify the category, caliber, and model of the gun, reaching over 90% accuracy on a dataset composed of 3655 samples that are extracted from YouTube videos. Our results demonstrate the effectiveness and efficiency of applying Convolutional Neural Network (CNN) in gunshot classification eliminating the need for an ad-hoc setup while significantly improving the classification performance.


I. INTRODUCTION
Gunshot analysis have received significant attention from both the military and scientific communities. Acoustic analysis of gunshots can provide useful information, such as the position of the shooter, the projectile trajectory, the caliber of the gun, and the gun model. Although acoustical evidence may significantly contribute to audio forensic reconstruction and analysis, the forensic analysis of gunshots is characterized by many challenges due to the broadcast and noisy nature of the acoustic channel.
Consider a scenario where a microphone is deployed in a close neighborhood to the shooter. The recorded audio sample can be significantly affected by the environmental surroundings, such as trees, foliage, and buildings, which attenuate and reflect the main component of the shock wave. The resulting audio sample may feature different echoes of the gunshot that are characterized by different attenuation factors as a function of their paths. This naive approach is impractical, which motivated the development of more complex ad-hoc acoustic data acquisition strategies over the last decade. To mitigate echoes and overcome the intrinsic lack of information, that the aforementioned scenario suffers from, additional microphones are deployed. The comparison of multiple replicas of the same gunshot enables shooter localization and weapon identification. The physical characteristics of acoustic propagation can be exploited to infer the position of the shooter and the category of the gun. Multiple spatially diverse acoustic sources enable the estimation of Angle of Arrival (AoA), Time of Arrival (ToA), and Time Difference of Arrival (TDoA). The obtained recordings can be modeled by geometrical acoustics that enable the localization of the shooter. Furthermore, multiple replicas of the same acoustic source allow to filter out echoes and background noise affecting a subset of the deployed microphones, thus enabling a deep characterization for both the time and frequency domains.
Acoustic acquisition via Wireless Sensor Network (WSN) requires a specialized infrastructure overlay to enable sensor communication, data processing, and computation distribution. Solutions that rely on the spatial diversity provided by the WSN introduce several types of burdens. Firstly, each soldier has to carry a wearable device equipped with a microphone and other sensors, such as a compass, to collect meaningful information about AoA, ToA, and TDoA. Secondly, in a military scenario, the WSN should feature a jamming-resistant communication protocol and non-interfering radio channels. Both assumptions are difficult to achieve given the resource constraints of WSNs in terms of CPU, battery, and memory. In most cases, WSNs cannot afford the computational burden of multimedia processing. Therefore, the captured data should be first off-loaded to a remote server, then downloaded and distributed again. This represents a challenge from the connectivity perspective since, in many cases, military WSNs are unattended or provided with a discontinued link to the control center.
In this work, we do not rely on ad-hoc acquisition setups, but we exploit publicly available audio recordings of gunshots, considering their temporal and spectral representations. Spectral analysis of sound has been adopted in many contexts to detect and identify recurrent patterns. In particular, the combination of time-frequency decomposition of audio samples with Convolutional Neural Network (CNN)s provides promising performance in detecting recurrent patterns. The CNN is trained over several "images" constituted by a three-dimensional representation of time, frequency, and amplitude. The result is a robust solution that can "recognize" the same sound by cross-matching similar images.
Contribution. We propose an inexpensive solution that is able to detect and identify gunshots without resorting to any ad-hoc infrastructure. Contrary to other studies, our solution requires only an audio sample of a gunshot that can be easily obtained by any commercially available microphone. Our approach is agnostic to the microphone position with respect to the shooter, and it does not require multiple spatially different replicas of the gunshot; we consider recordings from mono-channel setups with different sample rates. We proved the effectiveness of our solution by considering 3655 samples of gunshots constituted by 30 pistols, 18 rifles, and 11 shotguns for a total of 7 different calibers. The proposed approach guarantees an accuracy higher than 90% for all of the considered cases, namely, the category, model and caliber of the gun.
Paper organization. The remainder of this paper is organized as follows. Section II summarizes recent contributions in the field of weapon classification. Section III introduces the background concepts related to frequency domain analysis, CNNs, and acoustic characteristics of gunshots. Section IV describes our dataset and Section V discusses the the dataset generation process. The neural network architecture is presented in Section VI. Section VII shows the performance of our solution. Finally, Section VIII draws some concluding remarks.

II. RELATED WORK
Firearm classification based on the acoustic evidence generated by its discharge has long been investigated, but not extensively studied in the literature. Proposed solutions vary in many aspects, including the source of acoustic data, the type of analysis applied, the type of features extracted, and the application area. Table I summarizes prior studies, that provide gunshot classification and firearm identification, according to these aspects.
The source of the data is characterized by the type, the quality, and the environmental conditions of the deployed audio recording setup, which defines the amount of information that can be leveraged for classification. Most of the gunshot recordings used in the literature are either obtained under carefully controlled conditions, where a distributed set of microphone sensors are deployed [1], [2], [3], or extracted from a conventional recording device in less controlled environments [4], [5], [6], [7], [8].
In the former case, where a Wireless Acoustic Sensor Network (WASN) is deployed, spatial information can be obtained by performing array processing and triangulation techniques. Direction of Arrival (DoA) and ToA estimation methods are applied to the obtained audio signals to determine the projectile speed and trajectory, as well as to infer the position of the shooter. Such information may also provide discriminant features, such as the bullet speed [1], that can be used to identify the firearm category. Furthermore, the distributed nature of the recording setup provides spatial diversity, where multiple acoustic observations from different locations of the same gunshot are obtained, which can be leveraged to increase the classification accuracy. Sánchez-Hevia et al. [3] exploited this feature and proposed a multi-observation weapon classification system that leverages various classifier ensembles to enhance classic decision fusion techniques. Each node in the sensor network produces a classification decision using Least Squares Linear Discriminant Analysis (LS-LDA). The decisions are later fused using a Maximum Likelihoodbased fusion rule that weights the decision of each node based on its location.
The main constraint induced by this type of analysis is the requirement of spatial information, which can only be obtained by deploying a distributed sensor network. Therefore, limiting the applicability of gunshot detection and firearm classification to a carefully controlled recording setup only. Consequently, various pattern recognition approaches were proposed that identify the firearm category in the absence of spatial information. The most used classifiers for firearm identification are Gaussian Mixtures Model (GMM) [4], [5], [6] and Hidden Markov Model (HMM) [7], [8].
Most of these approaches can be described as frame-based feature classification approaches [4], [5], [6], [8], where the time-domain acoustic signal is subdivided into a sequence of short-time windowed frames. From each frame, a set of predetermined features is extracted and used for gunshot classification. The most common extracted features are statistical measures of the spectrum and intensity of the signal, in addition to perceptual features such as Mel-Frequency Cepstrum Coefficient (MFCC) or Perceptual Linear Prediction Coefficients (PLP). Temporal features, such as energy and Zero Crossing Rate (ZCR), are also used, but only in conjunction with spectral or perceptual features.
Morton et al. [7] proposed an alternative classification approach that does not rely on frame-based features aiming to eliminate the dependency on performance-driven parameters, which are often optimized over a finite training set. They proposed modeling each firearm category as an HMM with AutoRegressive (AR) source densities using non-parametric Bayesian priors to allow automated model order selection. The AR defines a set of energy and spectral characteristics of the captured gunshot, while the HMM identifies the transitions of these states.
The aforementioned techniques may perform adequately in matched experimental conditions, however, their effectiveness could reduce significantly when capture conditions vary in challenging unstructured environments, where noise and distortion are present. Although Khan et al. [4] addressed this problem by using an exemplary embedding approach to bridge between varying recording conditions, the achieved classification accuracy is relatively low (i.e., 60-72%). The authors used a dataset of 100 gunshot samples obtained from 20 different firearm models, where each model is represented by 5 to 15 gunshot samples. The different conditions included in their experiments were simulated, namely, "Room Reverb", "Concert Reverb", and "Doppler Effect", which may not match real-life environmental conditions and do not include directional variations. Furthermore, their approach assumes prior knowledge of the recording conditions which is not always possible, especially in audio forensic reconstruction analysis. Our solution, being the only one considering varying environment conditions and not requiring an ad-hoc setup, outperforms the state of the art studies in terms of dataset richness, including the number of gunshots samples and range of weapon models, reaching 90% accuracy.

A. Spectrogram
A spectrogram is one of the most widely adopted visual representations of the frequencies spectrum of a signal over time. Being defined as an intensity plot of the Short-Time Fourier Transform (STFT) magnitude, a spectrogram is usually portrayed as a bi-dimensional graph, where one axis (usually the x-axis) represents time and the other axis (usually the y-axis) represents frequencies. An example of spectrogram is depicted in Fig. 1. Each intersection between time and frequency is assigned a color that refers to the Power Spectral Density (PSD) of that specific frequency at that particular time, which is considered a third dimension of the graph. To compute the spectrogram of a signal y, the signal is divided into shorter fixed-length segments y 1 , . . . , y n , and the Fourier transform is applied separately to each segment. The spectrogram describes the changes of the signal frequencies spectrum as a function of time. This implies that, if the time is discrete, the data to be transformed may be partitioned into overlapping frames. The STFT is applied to each of the frames and the result, consisting of both phase and magnitude for each intersection between time and frequency, is stored in a matrix, as showed in Equation 2.
The result consists of a bi-dimensional matrix that maps the audio frequencies to the time-localized points [9]. The visual representation of audio traces through spectrograms have been extensively leveraged in the literature in the context of audio classification [10], sound event classification [11], emotion recognition [12], human activity recognition [9], cross-modality feature learning [13], and gunshot classification [14].

B. Convolutional Neural Network
A CNN belongs to the class of deep neural networks that have one or more convolutional layers (i.e., layers that perform convolution operations). A convolution is a linear operation that consists of a slide of a parametric-sized filter over the input representation (usually a visual image). The application of the same filter to different overlapping filter-sized portions of the input generates a feature map. There are several types of filters, also known as operators. Each filter tries to identify a specific feature within the input representation. For example, the Sobel, the Prewitt, and the Canny operators highlight edges, the Harris and the Shi and Tomasi operators highlight corners, etc. One of the most powerful features of CNNs, that is also the reason behind their wide adoption, is the ability to automatically apply an extensive number of filters to the input representation in parallel, thus highlighting specific features in every part of the input image simultaneously.
CNNs can be seen as regularized versions, that discourage learning complex models, of multilayer perceptrons. While in multilayer perceptrons several fully connected layers are used-a layer is fully connected if all the neurons it is composed of are connected to all the neurons of the next layer, CNNs exploit a hierarchical structure that allows building complex patterns by using small and simple patterns. The dimension of the input image (in this case representing a handwritten digit), keeps decreasing while going deeper in the neural network, while the number of filters, thus the features the architecture desires to highlight, increases. A CNN usually has three types of layers: (i) convolutional layers, to perform the convolution operations to the input, (ii) pooling layers, to discretize the input and reduce the number of learnable parameters; and (iii) fully connected layers, that are essentially feed-forward neural networks, usually placed at the end of the architecture. The goal of the fully connected layers is to hold the high-level features found during the convolutions and try to learn non-linear combinations of these features before assigning the input image a label. Details about these layers contextualized in our model are provided in Section VI-A.
One of the fundamental decisions to be taken when designing a CNN, or generically a neural network, concerns the representation of the input data. Several input representations are available in the literature, each bringing its advantages and drawbacks. Although for visual images the choice is straightforward, for audio samples numerous alternatives are possible, including MFCC, raw digitized sample stream, machine discovered features, and hand-crafted features. Even if the best input representation to adopt is strongly dependant on the problem to solve, several studies in the literature show that feeding CNN with spectrograms is effective in many fields, including musical onset detection [16], human detection and activity classification [17], music classification [18], and other interesting activities [19].

C. Guns and Gunshots
Gunshots are the result of multiple acoustic events, namely, the muzzle blast created by the explosion inside the barrel and the ballistic shockwave that is generated by the supersonic projectile. These phenomena are the results of many characteristics and variables that eventually sum up and generate the acoustic blast, which include the firearm type, model, barrel length, ammunition type, powder quantity, weight and shape of the projectile, and possibly others. The aim of this work is to estimate at what extent it is possible to use a gunshot as a unique fingerprint that uniquely identifies one or more of the aforementioned variables. Figure 3 summarizes the most important characteristics affecting the acoustic blast generated by a gun. Our observation is that different configurations of the aforementioned parameters may lead to unique gunshot patterns that can be detected by analyzing the frequency-time decomposition of the gunshot blast. In the next sections, we demonstrate how Convolutional Neural Networks (CNNs) can be effectively used to detect these patterns, thus uniquely identifying the category of gun, the caliber, and finally, the model of the gun. Table II shows the dataset considered in this work. We collected the samples from several YouTube videos, such as C4Defense, hickok45, EmanuelRJSniper, mixup98, OneGear, and ReloaderJoe. Our choice of guns takes into account two main aspects: the Category of Guns and the Caliber.

IV. DATASET DESCRIPTION
Category of guns. We considered 30 different pistols, 18 rifles, and 11 shotguns. As for pistols, we considered 22 revolvers and 10 semiautomatic.
Caliber. We took into account the most popular calibers in U.S. and world-wide [20], [21], such as 9mm and .45acp for automatic pistols, .44M and .357M for revolvers, 7.62x39 and 5.56NATO for rifles, and 12 gauge caliber for shotguns.

A. Muzzle blast: preliminary considerations
When a gun is fired, there are two distinct acoustic phenomena, the muzzle blast and the ballistic shockwave [22]. The latter is generated by the bullet that compresses the air in front of itself creating a sonic boom that propagates with a shape of a cone where the vertex is the bullet itself. Conversely, the muzzle blast is a high energy acoustic signal originated by the gun's muzzle with a spherical wavefront, propagating at the speed of sound, and with center the muzzle of the gun. The ballistic shockwave is a very important source of information to locate a sniper in an open field [23], [24]. However, to achieve that, the ballistic shockwave has to be sampled from different locations requiring an array of microphones. The ballistic shockwave cannot be observed for subsonic projectiles such as those used in shotguns and pistols.
Given the aforementioned considerations, we focus on the muzzle blast and the echoes associated with it. In the following, we discuss and highlight three critical parameters that have to be carefully set in order to maximize the detection performance of a neural network: (i) muzzle blast duration, (ii) number of frequency bins, and (iii) the number of time slots. Figure 4 shows the acoustic signal amplitude recorded from a Beretta PX4 Storm, 9mm. The muzzle blast lasts for a few milliseconds (up to 5ms in the figure), depending on the model of gun and caliber. We also observe some echo effects (Echo 1, Echo 2, and Echo 3) at 10ms, 22ms, and 63ms due to reflections of the sound from obstacles around the shooter. We highlight that this is consistent with previous findings from other studies [22], while the muzzle blast duration will be a critical parameter from the analysis carried out in this work. Figure 5 shows the PSD as a function of time and frequency (spectrogram) associated with the muzzle blast in Fig. 4. We consider both the bi-dimensional and the three-dimensional representation of the spectrogram. We observe that the muzzle blast (time less than 5ms) takes all the frequency components between 0 and 24KHz with a significant power spanning between -30dB (lower frequencies) and -80dB (higher frequencies). As soon as the blast finishes, the echoes take the frequencies less than 18KHz with a decreasing power between -40dB and -60dB. The aforementioned spectrogram components constitute the input for the training process of our neural network.
We identify two more critical parameters affecting our algorithm performance: the number of frequency bins and the number of time slots. For our analysis, we adopted the

B. Quality of the audio samples
In the following, we provide a quantitative analysis of the quality of the collected audio samples. As a quality metric, we consider the Signal-to-Noise Ratio (SNR) computed on each muzzle blast from the actual starting of the blast for a period of 400ms. For each audio sample, we consider a pre-defined reference noise pattern constituted by random samples of amplitude 0.1, i.e., one-tenth of the maximum signal amplitude taken by the microphone. The previous sound pressure is equivalent to a classical background noise that can be sampled from an outdoor environment characterized by a gentle wind. Figure 6 shows the probability distribution function associated with the SNR computed as described before. The overall audio quality is very high since the muzzle blast is +20dB higher than the reference noise pattern. We observe that even the echoes can be easily identified from the noise reference.

V. DATASET GENERATION
We generated a dataset of 3655 samples extracted from videos found on YouTube. Each of the collected audio samples has a sample rate of either 48000 or 44100 samples per second. Generating a dataset of gunshots extracted from YouTube videos involves the following steps: • Audio extraction. We performed the audio extraction (MP3 format) from the selected videos using youtubedl [25] and ffmpeg [26] tools. • Abrupt change detection. A preliminary filtering is performed by identifying abrupt changes in the audio signal. • Gunshot detection. Gunshots are detected among blasts by relying on a Support Vector Machine (SVM) learning algorithm.
In the following, we describe the procedure of automatically extracting gunshots from an audio trace focusing on Blast detection and Gunshot detection.

A. Identification of abrupt changes in an audio trace
To detect abrupt changes in an audio trace, we computed the variance over a sliding window of 5ms, equivalent to either 220 or 240 samples depending on the quality of the audio trace, i.e., 44100 or 48000 samples per second, respectively. Subsequently, we searched for the peaks adopting windows of size 0.3 seconds and a minimum peak prominence of 0.3. Fig. 7 shows the three computation stages from the sound pressure to the blast sequences that are passing by the moving window averaging. This figure refers to two sound chunks extracted from an audio trace, where the first part (i.e., 0 ≤ t ≤ 5.5 seconds) is a sequence of gunshots, while the second part (i.e., t > 5 seconds) is mainly constituted by voice. We stress that the main aim of this part is to detect abrupt changes in the sound pressure, while subsequently we will show how gunshots are identified. Fig. 7: Detection of abrupt changes in audio traces: from sound pressure to abrupt change detection by computing moving variance and peak detection.

B. Gunshot detection
Gunshot detection is performed via a human-assisted supervised learning approach. The intention is to have a growing training set of actual gunshots that is supervised by the user. The user checks for both false positives and false negatives by listening to the newly generated samples in the training set. Figure 8 shows the training, validation, and testing procedures. We assume that the training set is populated with an initial dataset of actual gunshots that have been manually selected. In our case, we started from an initial dataset of 10 gunshot samples only. At each cycle, a new model is trained with the current training set (Step 1 in Fig. 8). Subsequently (Step 2 in Fig. 8), new samples are selected from the list that is generated by the procedure presented in Section V-A. Finally, the generated samples are tested with the current training set. The output is assessed by the supervisor (Step 3 in Fig. 8), and the verified samples are added to the training set (Step 4 in Fig. 8).  Classification performance. To assess the quality of the classification procedure, we considered 6 additional videos (V1, . . . , V6) downloaded from YouTube, which are not included in the training set. For each video, we detected the abrupt changes according to the procedure presented in Section V-A, and we executed the gunshot detection procedure presented in Fig. 8. As for the Training Set, we considered the one we generated from the samples found in Table II. Figure 9 shows the frequency of the similarity indexes provided by the SVM classifier for the Shot and No-Shot audio samples with red crosses and green circles, respectively. The similarity indexes were categorized into bin width of 10, where each cross/circle aggregates adjacent similarity indexes. Figure 9 represents the decision after one iteration of the procedure presented in Fig. 8. The decision Shot vs No-Shot is taken as a function of the threshold T hr, which has been empirically set to zero. We observed that 96% of the No-Shot samples feature a similarity index of -189.5, while the remaining 4% are spread between -178.9 and -0.55. There are no samples from the No-Shot class with a similarity index that is greater than 0. As for the Shot class, the samples are distributed between 0.41 and 275, with frequencies between 1% and 11%. Even in this case, we highlight that there are no samples from the Shot class with a similarity index that is less than 0.
To precisely assess the effectiveness of our solution, we manually checked all of the classified samples, namely, Shot vs No-Shot. Table III shows the result of our analysis. For each video, we report the number of detected abrupt changes (N ), the threshold used by the SVM classifier (T hr = 0), True Positive (T P ), False Positive (F P ), True Negative (T N ), False Negative (F N ), the actual number of gunshots (Actual), and the overall accuracy of the detection algorithm. As previously stated, during our evaluation, we considered only one iteration  as depicted in Fig. 8. We would like to highlight that the proposed algorithm achieves the main purpose of generating a dataset of gunshot samples (i) in a fast and efficient way and (ii) with the minimum amount of false positives. The output of this phase will be the training set to be used by the CNN. At this stage, we aim at minimizing the number of F P , which might bias the subsequent training process. We also aim at maximizing the process efficiency of creating a large dataset of gunshot samples. Therefore, the task of the supervisor resorts mainly to listening to a very few samples (T P + F P ) despite the dataset N , in order to remove the F P , which are overall very few: only 2 out of 4931 samples. Conversely, we observe that our approach might lose some good samples (F N = 16 + 4). However, these samples do not affect the performance of our solution hence we consider them not important.
The above procedure has been applied to each audio sample found in Table II to generate a dataset of actual gunshot samples, that is, one dataset for each gun model. Figure 10 depicts the overall architecture of our CNN consisting of five layers with weights: the first four are convolutional layers, while the last one is a fully connected layer. The output of the fully connected layer is fed to a 7way softmax, that outputs the probability distribution over the 7 class labels. The details of our architecture, including information about the layers and their learnable parameters, are reported in Appendix A.

VI. OVERALL ARCHITECTURE
Considering the dimension of the starting image and the need to give importance also to peripheral pixels, every convolution of our architecture makes use of padding to avoid losing information. By adding additional pixels to the border, every convolution layer outputs an image with the same number of pixels of the one fed into that layer. Furthermore, in our CNN architecture, we make use of a stride of 1 during convolutions, and a stride of 2 during the Max Pooling application. The stride is a critical hyperparameter in the context of CNN, as it allows to specify the number of cells by which filters (e.g., convolution filters, pooling filters) slides over the image. If the stride is equal to 2, the filter starts from the top left corner and moves over the image with jumps of 2 units at a time. By considering square filters (i.e., f xf ) and square initial images (i.e., nxn), after having specified the dimension of the filters f, being them convolutional filters or pooling filters, the stride parameter s, the dimensions of the initial images n, and the padding p, it is possible to calculate the dimension of the square output image of a layer as: Our choice to keep a unit stride during the convolutions and a stride equals to 2 during the pooling is guided by the intention of not losing information during convolution phases, while exploiting the pooling technique to summarize the features, thus reducing the input dimensionality.
The first convolutional layer filters the 36 x 99 x 1 spectrogram image with 40 kernels of size 3 x 3 x 1, without any stride and with 'same' padding. The second convolutional layer takes as input the normalized (40 channels) and pooled (3x3 max pooling, stride = 2) output of the first convolutional layer and filters it with 80 kernels of size 3 x 3 x 40. The third convolutional layer takes as input the normalized (80 channels) and pooled (3x3 max pooling, stride = 2) output of the second convolutional layer and filters it with 160 kernels of size 3 x 3 x 80. The fourth convolutional layer takes as input the normalized (160 channels) and pooled (3x3 max pooling, stride = 2) output of the third convolutional layer and filters it with additional 160 kernels of size 3 x 3 x 160. The normalized (160 channels) and pooled (1x13 max pooling, stride = 2) output of the fourth convolutional layer is fed to a 7-neuron fully connected layer that, in turn, outputs the result to a 7-way softmax, that produces a distribution over the 7 class labels.

A. CNN Details
Activation Function. Our neural network relies on the Rectified Linear Units (ReLU) activation function [27] after each convolution. The ReLU activation function, whose operation is simplified in equation 4, outputs the maximum value between zero and the input value.
Although the literature uses other variants (e.g., Tanh, Soft-Sign, Sigmoid), several studies show that ReLU outperforms the competitors in terms of performance [28], [29]. Regularization. Our neural network relies on Dropout [30] regularization to reduce the likelihood of overfitting. The Dropout regularization technique allows to randomly cut out units (together with their connections) from the neural network during the training phase with a given probability. This discourages neurons to rely on the presence of particular other neurons and forces them to find more robust features with different ones [29], thus reducing the probability of learning the training set by heart. Normalization. Training neural network without normalization brings to the internal covariate shift phenomena, where the distribution of each layer's input change during training, thus requiring a more sophisticated tuning of the parameters. To mitigate this issue we add Batch Normalization layers after each convolution. The Batch Normalization technique [31] performs the normalization for each training mini-batch, allowing the usage of higher learning rates and reducing the need for a cherry-picking tuning of the parameters. As summarized in the equation 7, Batch Normalization normalizes the output of an activation layer by subtracting the mean and dividing by the standard deviation of the batch. Given a mini-batch β = x 1 , ..., x m : Although the Batch Normalization technique brings a slight regularization effect to the neural network, in some cases eliminating the need for Dropout [31], we find that the combined use of the Batch Normalization and Dropout aids generalization [29]. Discretization. The application of discretization techniques to an input representation consists of reducing its dimensionality to evaluate the features within the obtained, summarized subregions. This process allows to mitigate the overfitting of the training set and to reduce the number of parameters to be learned for the training, thus reducing the overall computational cost. To attain these benefits, in our architecture we use a Max Pooling sample-based discretization process layer after each activation layer. Max Pooling applies a max filter to non-overlapping sub-regions of the input feature map, whose dimension is dictated by the dimension of the filter. When Max Pooling is applied, the passage of the moving filter onto a sub-region produces, as output, a value, consisting of the maximum value of that sub-region. Output. As for the output layer, our neural network architecture relies on the commonly used softmax function. The softmax function, taking as input a vector of real numbers, produces a probability distribution proportional to the exponential of the input numbers. In detail, the input real numbers are mapped in a (0,1) interval that sums up to zero, thus allowing to treat the output provided by the softmax function as probabilities. In general, given a vector of real numbers v = (v 1 , . . . , v K ) ∈ IR K , the standard unit softmax function σ : IR K → IR K is defined by: B. Learning Details  Optimizer. An optimizer is defined as an algorithm (or a method) used to tune the parameters of a neural network with the goal of reducing the loss function. In our architecture, we rely on the Adaptive Moment Estimation (Adam) optimizer [32], an extensively adopted optimizer that inherits the advantages of both Root Mean Square Propagation (RMSProp) and Stochastic Gradient Descent (SGD) with momentum (i.e., SGD where each gradient update is a linear combination of the previous gradient updates) optimizers. From RMSProp it inherits the squared gradients to scale the learning rate, while from SGD with momentum it inherits the concept of the moving average of the gradients. An empirical analysis conducted in [32] shows that Adam outperforms the other optimizers, thus working better in practice. As recommended in the original paper (whose algorithm is reported below with our parameters), in our implementation we set to 0.9 the gradient decay factor β 1 , to 0.999 the squared gradient decay factor β 2 , and to 10 −8 the denominator offset (to avoid divisions by zero), respectively. However, although the original paper recommends using an initial learning rate of 10 −3 , we empirically found (relying on the grid search hyperparameter tuning technique) that setting this value to x * 10 −4 , x ∈ [1, 3] provides better results.
Algorithm 1: Adam Optimizer [32] Require: α: Stepsize Require: f (θ): Stochastic objective function Require: θ 0 : Initial parameter vector m 0 ← 0 (Initialization of the 1 st moment vector) v 0 ← 0 (Initialization of the 2 nd moment vector) β 1 ← 0.9 (Initialization of the gradient decay factor) β 2 ← 0.999 (Initialization of the squared gradient decay factor) t ← 0 (Initialization of the timestep) ← 10 −8 (Initialization of the denominator offset) α ← x · 10 −4 , x ∈ [1, 3] (Initialization of the learning rate) while θ is not converged do Number of Epochs. An epoch is defined as a single pass through the training set, i.e., 1 forward pass and 1 backward pass for all the training samples. We empirically set as 50 the max number of epochs, since each of the subsequent epochs does not bring any benefit to our model learning.
Mini-Batch Size. Using mini-batch that consists of processing small subsets of training samples in every iteration, instead of processing them all together. The choice of mini-batch size (e.g., the number of training samples to process) does not affect the performance of the model in terms of accuracy, but affects the resource required during the training process. A larger mini-batch size requires more memory and takes more time per epoch, but allows to better optimize the vectorization (i.e., the linear transformation of a matrix into a column vector), while a smaller mini-batch size requires less memory but loses the speed-up given by vectorization. In our model, we set the mini-batch size to 8, to better optimize the resources of our server. Shuffle. The "shuffle" option allows shuffling the order of which training samples are fed to the model, with the goal of reducing variance, thus reducing overfitting. Shuffling the training samples becomes crucial in case mini-batches are used, due to the need to avoid having batches containing highly correlated samples that would slow down (or, in many cases, compromise) the performance of the model. In our model, we shuffle the training data before each training epoch, as well as the validation data before each validation. Plot. The "plot" option in Matlab provides several information to be taken into account during the training process. Information include, but are not limited to, the mini-batch training loss and accuracy, the smoothed training loss and accuracy (i.e., the result of the application of a smoothing algorithm to the training accuracy), the validation loss and accuracy, hardware resources, etc. Validation Data. The validation data, also known as validation set, refers to a subset of samples separated from the training set, that the model will rely on to evaluate the effectiveness of its training. In our case, by following the 80/20 rule, the validation set is represented by 20% of the whole dataset. Validation Frequency. The validation frequency represents the number of iterations between evaluations of validation metrics. We empirically set this value to |training set| miniBatchSize .

A. Category of Gun Identification
In this section, we consider the neural network previously introduced to infer the Category of gun. We reconsider Table II and we divide the dataset into three classes, namely, Pistols, Rifles, and Shotguns, according to the gun models in the dataset. Figure 11 shows the confusion matrix computed as the average of 50 training and validation runs.  The accuracy acc can be computed according to Eq. 9.
x ii (9) where N = 722 is the total number of samples, N C = 3 is the number of classes, and x ii is the i th diagonal element of the confusion matrix, yielding to acc ≈ 0.92. The confusion matrix in Fig. 11 also reports summaries of columns and rows, predicted and true classes, respectively.
We observe that the classification error spans between 4.6% and 13.9% for Pistol and Rifle classes, respectively. The class Rifle (an actual gunshot from a rifle) is incorrectly classified as either Pistol (5 times) or Shotgun (22 times) in the 13.9% of the cases. The same type of analysis can be performed column-wise, where the prediction error spans between 1.4% and 25.2%. As an example, we observe that a prediction on class Shotgun is wrong in the 25.2% of the cases (7 times for Pistol and 22 times for Rifle).
Finally, we observe that while the Pistol class is likely to be correctly classified all the times, the vast majority of errors are happening between the Rifle and Shotgun classes.

B. Caliber Identification
In this section, we report the performance of our classification algorithm when considering 7 different calibers from Table II. We group the video chunks based on gun caliber, obtaining 7 different classes, namely, 12, 357M, 44M, 45acp, 556NATO, 762x39, and 9mm. Figure 12 shows the confusion matrix computed as the average of 50 training and validation runs. The overall accuracy computed according to Eq. 9 sums up to acc ≈ 0.9. Best and worst performance are achieved by 9mm and 762x39, respectively. In particular, class 762x39 is wrongly predicted 8 times as class 556NATO. Classes 556NATO and 762x39 are intrinsically similar, since both are from class Rifle. Therefore, they are prone to be confused. Nevertheless, we observe that this phenomenon is very limited since we have 4 cases of 556NATO classified as 762x39, and 8 cases for the opposite configuration. We also observe that 556NATO and 762x39 classes experience a significant amount of misclassifications with classes 12 and 357M. Conversely, class 9mm is the most likely to be correctly classified.   Highlights. By combining Fig. 11 and Fig. 12 we can draw some interesting remarks. Rifle class is misclassified for class Shotgun 20 times (the opposite is happening 12 times) in Fig. 11, while 6+8=14 times (5+2=7 times) in Fig. 12. We think that the error is not due to a specific caliber, either the 556NATO or the 762x39, but to the feature similarities between the two classes; Shotgun and Rifle classes.
Pistol class is also misclassified as Rifle class 15 times. By looking into the details of Fig. 12, we observe that the major source of misclassifications is coming from the 357M class, classified 4 times as 556NATO and 1 time as 762x39. We observe that the 357M is the most powerful among the pistol calibers hence it is the closest to Rifle class in terms of bullet size, pressure, and barrel diameter.
Finally, we observe that our solution is particularly robust in detecting pistols. In particular, one of the most adopted worldwide caliber (9mm) is characterized by a very limited number of misclassifications (11 out of 167 total). The same considerations apply to classes 44M and 45acp.

C. Model Identification
In this section, we consider all of the gun models previously introduced in Table II with the aim of classifying each of them. The total number of classes sums up to 59, which is the number of gun models considered throughout this paper. We report the confusion matrix associated with the aforementioned classification in Appendix B. The accuracy sums up to acc ≈ 0.90 and the maximum number of misclassifications (per model) never exceeds 2. We observe that class 38 (Ruger GP100 Match Champion) is never correctly classified. Finally, we highlight that the number of samples for the validation process is small (20% of each gun model in Table II). Nevertheless, the diagonal of the matrix in Appendix B collects the vast majority of the samples confirming the effectiveness of our model. We are confident that a larger data sample can increase the accuracy performance and effectiveness of gun model detection from gunshot sounds.

D. Testing
To validate our methodology, we tested the model against a new set of audio samples taken from videos different than the ones considered before with varying conditions, including the background noise and relative positions between the microphone and the shooter. We consider a total of 115 audio samples constituted by 13 Pistol (Beretta 92 FS), 59 Rifles (Ruger AR, Daniel Defense M4 A1 SOCOM, Maadi AK-47) and 44 Shotguns (Maverick 88, Winchester Model 300 Defender). We observe that Pistol and Rifle classification is characterized by high performance, where only 4 Rifles samples are misclassified for Pistol. As for the Shotgun class, we highlight that the two shotguns considered are not in the training set (Table II) because we did not find any valid samples from additional videos. Although the audio samples are coming from different shotgun models, our algorithm can still detect the caliber with high probability (only 8 audio samples are misclassified), which verifies the effectiveness and correctness of our algorithm. Finally, we observe that the overall accuracy is consistent with the validation process and sums up to about 0.9.

VIII. CONCLUSION
Although scenarios requiring in-depth digital forensic of gunshots are countless, including military operations, massshooting, and possibly others, current solutions are far from reaching an adequate accuracy under real conditions.
In this paper, we have proposed an effective and efficient methodology to uniquely fingerprint gunshots enabling the identification of the category, caliber, and the model of the gun with an accuracy higher than 90% regardless of the capture conditions. Unlike exsiting solutions, our technique requires neither ad-hoc deployment of microphone networks, nor specific sample quality, and is agnostic to the microphone position with respect to the shooter. We have demonstrated that forensic analysis in the time-frequency domain of a single gunshot audio sample recorded by a commercial microphone (44100 samples per seconds) can be effectively used to infer the gun model (and other related characteristics). The proposed solution may lead to new insights and further developments in the area of weapon classification considering more samples, different noise levels, and a much larger weapon database.