To validate the ideas proposed in the previous section, an audio dataset has been built, trying to represent as much as possible heterogeneous sources. To this aim, the database includes uncompressed audio files belonging to four different categories: Music, royalty free music audio tracks, with five different musical styles[16]; Speech, music audio files containing dialogues; Outdoor, audio files relative to recording outdoors; and Commercial, files containing dialogues combined with music, as often happens in advertising. Each category collects about 17 min of audio. Throughout all the experiments, we employed the latest release of the lame codec (available athttp://lame.sourceforge.net), namely version 3.99.5. This choice was motivated by the widespread adoption of this codec and by the fact that it is an open source project.
4.1 Double compression detection and first compression bit-rate classification
In order to evaluate the performance of our method in detecting double compression, we divided each audio file into 250 segments 4 s long, for a total of 1,000 uncompressed audio files. Each file has been compressed, in dual mono, with bit-rate BR1 chosen in [64, 96, 128, 160, 192] kbit/s, obtaining 5,000 singly compressed MP3 files. Finally, these files have been compressed again using as BR2 one of the previous bit-rate values (also the same value as the first one was considered) achieving 25,000 MP3 doubly compressed files. Among these, 10,000 files have a difference Δ = BR2-BR1 between the second and the first bit-rate which is positive, taking value in [32, 64, 96, 128] kbit/s; 10,000 files have a negative difference Δ taking value in [-128, -96, -64, -32] kbit/s and 5,000 files have Δ = 0. The overall dataset is thus composed by 30,000 MP3 files, including 5,000 singly compressed files.
As a first experiment, we computed the values assumed by the proposed feature for all the 30,000 files belonging to the test dataset. The chi-square distance has been calculated by evaluating the MDCT histograms on 2,000 bins, with step size equal to one. In Figure7, the chi-square distances are visualized: singly compressed files and doubly compressed files with negative or zero Δ show a distance D near to zero, whereas the other files have D rather higher than zero. By comparing D with a threshold τ, it is possible to discriminate these two kinds of files: doubly compressed files with Δ > 0 and the other ones.
By adopting a variable threshold τ, we then computed a receiver operating characteristic (ROC) curve, representing the capacity of the detector to separate singly compressed from doubly compressed MP3 files (including Δ > or ≤ 0). The trend of the obtained ROC curve is shown in Figure8 (left): it reflects the bimodal distribution of distances of doubly compressed files (blue and green colored in Figure7) and highlights that the detector is able to distinguish only one of the two components (the one with Δ > 0). If we separate the previous ROC in one relating to files doubly compressed with positive Δ and one to files doubly compressed with negative or zero Δ, as is shown in Figure8 (right), when doubly compressed files with positive Δ are considered, we obtain an almost perfect classifier, while when the cases with negative or zero Δ are analyzed, we are next to the random classifier.
As anticipated in Section 2, the distance representing the proposed feature assumes a large range of values, as it can be clearly observed in Figure7. In order to highlight the relationship between the values taken by the feature and the compression parameters (i.e., BR2 and Δ), we examined in more detail the 10,000 doubly compressed files with positive Δ, plotting their chi-square distances in Figure9. Such values were plotted according to the different Δ: in particular, there are 4,000 files with Δ = 32 (yellow), 3,000 files with Δ = 64 (violet), 2,000 files with Δ = 96 (sky-blue), and finally 1,000 files with Δ = 128 (black) and grouped for different BR2: [96, 128, 160, 192]. Figure9 shows that the values of chi-square distance tend to cluster for different Δ factors and different BR2 values (see plotting with same color). On the one hand, the increasing Δ values correspond to increasing Chi-square distances between the observed and simulated distribution. On the other hand, given a value of Δ, for different bit-rate of the second compression, slightly different values of distance are obtained. This suggests the possibility of optimizing the detector by considering a specific threshold for each bit-rate of the second compression (a parameter observable from the bitstream). Moreover, the different values of the feature for different Δ factors can be used to classify the bit-rate of the first compression, as detailed in the following sections.
As to detection, we performed a set of experiments in order to compare the detection accuracy of the proposed scheme with respect to the detection accuracy of the methods proposed in[3, 4] by Liu et al. (LSQ method) and in[2] by Yang et al. (YSH method). Corresponding results are shown in Tables1,2, and3 respectively. The classifiers have been trained on 80% of the dataset and tested on the remaining 20%. Results have been averaged over 20 independent trials. In our method, different thresholds have been employed for different BR2 values, while for the other methods, different SVMs have been trained for different BR2 values. The proposed method achieves nearly optimal performances for all combinations with Δ>0. Also, the other methods generally achieve good performances for Δ>0, especially the LSQ method. Conversely, for Δ<0, the proposed method is not able to reliably detect double MP3 compression, whereas the other methods have better performances.
Table 1
Detection accuracy of the proposed method for different bit-rates
Table 2
Detection accuracy of LSQ method for different bit-rates
Table 3
Detection accuracy of YSH method for different bit-rates
All the results shown in Tables1,2, and3 have been achieved considering audio tracks 4 s long. We evaluated the degradation of the performance of the three detectors when the duration of the audio segments is reduced from 4 s to [2, 1, 1/2, 1/4, 1/8, 1/16] s. The reason behind this experiment is that analyzing very small portions of audio potentially opens the door to fine-resolution splicing localization (i.e., detect if part of an audio file has been tampered). Practically, instead of taking all the MDCT coefficients of the 4-s long segment, only the coefficients belonging to a subpart of the segment are retained, where the subpart is just one half, one fourth, and so on. For BR2 = 192 kbit/s and BR2 = 128 kbit/s, the detection accuracies (averaged with respect to BR1) have been plotted with varying audio file duration in Figure10 for our method, the LSQ method, and the YSH method.
The proposed method and LSQ achieve a nearly constant detection performance up to 1/8-s audio segments, whereas the performance of the YSH method drops for audio segments under 2 s. Our method achieves very good performance in the case of high-quality MP3 files: indeed, for BR2 equal to 192 kbit/s, our method achieves an almost perfect classification performance irrespective of the audio segment duration. Conversely, For BR2 equal to 128 kbit/s, even if the performance is only slightly affected by the segment duration, the proposed method remains inferior with respect to the other two methods.
By taking into account the feature distribution highlighted in Figure9, we then considered the capability of the proposed feature to classify the doubly compressed file according to the first compression bit-rate, as BR1 = BR2- Δ. A nearest neighbour classifier has been adopted for each different BR2 and the corresponding classification accuracy results are shown in Table4 for BR2 = 192 kbit/s and Table5 for BR2 = 128 kbit/s. The rows of the tables represent the actual bit-rate of the first compression and the columns the values assigned by the classifier. A comment about this experiment is in order: as shown in the previous section, the proposed method will hardly detect double compression for negative or zero Δ. Similarly, the output of the classifier on doubly encoded files with negative or zero Δ is not reliable. In particular, since a singly compressed file can be considered like having undergone a (virtual) first compression at infinite quality, the classifier cannot distinguish between singly encoded files and doubly encoded files with negative Δ.
Table 4
Classification accuracy of the proposed method for BR2 = 192
Table 5
Classification accuracy of the proposed method for BR2 = 128
For comparison, only the LSQ method is considered since the YSH method was not proposed for compression classification. The corresponding results obtained on 4-s audio segments for BR2 = 192, 128 kbit/s are shown in Tables6 and7, respectively.As in the case of detection, we experimented how the classification performances vary with respect to audio file duration. The average classification accuracy for different BR2 (i.e., [192, 160, 128, 96, 64] kbit/s) and decreased audio file duration (i.e., [4, 2, 1, 1/2, 1/4, 1/8, 1/16] s) are shown in Figure11 for both our method (top) and LSQ method (bottom). The accuracy is averaged over every possible BR1 in the dataset, thus providing a fair overall index of classification accuracy, that can be used to compare different methods and show a clear performance trend for different audio segment lengths.
Table 6
Classification accuracy of LSQ method for BR2 = 192
Table 7
Classification accuracy of LSQ method for BR2 = 128
In this scenario, LSQ appears to have a better classification performance, even if the proposed method performs reasonably well for the higher bit-rates. It is also worth noting that the performances of both methods suffer only a slight degradation for the shorter audio segments, which suggests good localization capabilities.
4.2 Tampering localization
As explained in Section 3, the proposed method for double compression analysis can be used as a tool for audio forgery localization. Although this possibility was not considered in the respective papers, previous experiments seem to prove that also the methods YSH and LSQ can be used for such a task, by analyzing the track using small windows. We thus designed a set of experiments to test the applicability of each of the considered algorithms to forgery localization: we divided the uncompressed files mentioned at the beginning of this section in segments of 10 s each, obtaining 100 files. Among these, 60 were chosen at random to build a training set, while the rest were used as the test set. Since the goal is to localize tampered segments within a file, test files were created as follows:
-
1.
The file was MP3 compressed at a bit-rate BR1 ∈ {96, 128, 160} kbit/s.
-
2.
The file was decoded and a portion of 1 s, located at the center of the track, was replaced with the same-positioned samples coming from the uncompressed file.
-
3.
The resulting track was re-compressed at a bit-rate BR1 + Δ, with Δ taking values in {-32, 0, 32} kbit/s.
In such a way, we created a cut-and-paste tampering that is virtually undetectable by a human listener (the pasted content is the same, just without the first compression); this also avoids facilitating the detector by introducing abrupt changes in audio content.
The above procedure creates files where 1/10 of the track is tampered, and the rest is untouched. While this scenario is reasonable for testing localization capabilities, it does not fit well to training a classifier (needed for methods YSH and LSQ), where a more balanced distribution of positive and negative examples is preferable. Keeping in mind that each file will be analyzed using small windows, and that each window will be classified as singly or doubly encoded, we generated the training dataset as follows:
-
1.
The file was MP3 compressed at a bit-rate BR1 ∈ {96, 128, 160} kbit/s.
-
2.
The file was decoded and the part of the track between second 5 and 6 was cut and appended at the end of the file.
-
3.
The resulting track was re-compressed at a bit-rate BR1 + Δ, with Δ taking values in {-32, 0, 32} kbit/s.
In such a way, samples in the first half of the track will show traces of double encoding, while samples in the second half will not, due to the induced misalignment of the quantization pattern.
Similarly to what we did in previous experiments, it is of interest to evaluate the localization performance of each algorithm for different sizes of the analysis window: a smaller size causes noisier measurements but higher temporal resolution, allowing the analyst to detect subtle modifications, like turning into a voice recording a word ‘yes’ into a ‘no’. The set {1/4, 1/8, 1/16} s was then chosen as possible values for the size of the window. When analyzing a file, the analyst knows both the size of the window he wants to employ and the bit-rate of the last compression undergone by the file. In light of this, the training procedure for algorithms LSQ and YSH can be done separately, creating a SVM for each BR2 (in our case, BR2 ∈{64,96,128,160,192}) and for each size of the analysis window. We chose RBF kernels and used fivefold cross validation to determine the best values for C ∈ {23,24,…212} and γ ∈ {2-4,2-3,…,25}. Concerning the proposed algorithm, we used the training samples to get a good initialization point for the expectation-maximization algorithm: specifically, we computed the average value of the χ2 distance obtained for singly compressed sequences and doubly compressed sequences available in the training set, resulting in μ1 = 0.004 and μ2 = 0.015, respectively. The algorithm stops when either the log likelihood stabilizes (difference between two iterations lower than 10-15) or a maximum of 500 iterations is reached.
Performance of YSH and LSQ algorithms were evaluated as follows: given a test file, the proper SVM was selected based on the observed bit-rate and the chosen window size; then the file was decoded and coefficients were classified, moving the analysis window. After repeating the same approach for all the files, we computed the probability of correctly classifying a window as tampered, denoted by Pr(T) and the probability of correctly classifying a window as original, denoted by Pr(O). Since by using probabilities we are normalizing the different size of each class, the final accuracy can be simply calculated as
A similar approach was used to evaluate performance of the proposed algorithm. After executing the EM algorithm, if a mixture of two Gaussians was found, we labelled as tampered those windows belonging to the model with lower mean; if only one Gaussian component was found, all the windows were classified as untouched. Finally, the localization accuracy of the algorithm was computed with the same formula described above.
Results are reported in Table8, for different sizes of the analysis window. Results are separated according to the difference between the first and second compression bit-rate (main rows), while the final value of BR2 is given in the columns (this is the only information available to the analyst). Since possible values for BR1 were limited to be in {96, 128, 160}, some combinations of BR2 and Δ are not explored. As we can see, the proposed algorithm yields comparable performance to the method YSH and outperforms LSQ when BR2 is higher than BR1. For zero or negative Δs, coherently with previous results, the proposed model is not able to discriminate forged regions.
Table 8
Tampering localization accuracy obtained by each algorithm for different window sizes