Normality of compression algorithms
The definition of a normal compressor deals with asymptotic behavior, allowing for an \(O(\log (n))\) discrepancy in the axioms of idempotence, monotonicity, symmetry, and distributivity. Thus, in theory, experimental validation (or refutation) of these axioms is not truly feasible – perhaps the behavior changes when the file size is beyond that of the largest file in our experiment. Nonetheless, we endeavor to experimentally explore these axioms more extensively than has been done in prior work.
Data We combined the traditional Calgary Corpus with the Large and Standard Canterbury Corpora, as well as the Silesia CorpusFootnote 1. The latter contains files of size ranging from 6 MB to 51 MB, greatly expanding the size distribution over the corpus explored in [3].
Idempotence Figures 1 and 2, show the difference in the sizes of C(X) and C(XX), and \(\log (|XX|)\), for a representative subset of files X in the dataset, with C ranging over compression algorithms bzip2 [17], lzma [16], PPMZ [2], and zlib [10]. Indeed, bz2 and zlib quite apparently fail the idempotence axiom, with |C(XX)| growing much faster than |C(X)|, with a term of \(O(\log (|XX|))\) unable to put a dent in the difference. While PPMZ and lzma appear significantly better for smaller file sizes, still, this value grows much faster than \(\log (|XX|)\), as apparent in Fig. 2. We see that lzma makes a large jump around 8 MB (but even before that, its growth is much larger than the \(\log \) function).
Symmetry Figure 3 shows the magnitude of difference between |C(XY)| and |C(YX)|. While in most cases, at this scale, this was bounded by \(\log (|XY|)\) (and in all cases by a small constant factor thereof), the asymptotic behavior is unclear, as values for all four compressors spike wildly. This is likely due to the fact that the extent of the symmetry is dependent on the compressibility, similarity, and/or size disparity of the two files involved. zlib and lzma look quite promising for symmetry, while the asymptotic behavior of PPMZ and bz2 is not discernible.
Distributivity The difference between \(|C(XY)| + |C(Z)|\) and \(|C(XZ)| + |C(YZ)|\) is shown in Figs. 4 and 5. As required by the distributivity property, these values are consistently non-negative for lzma and zlib. While bz2 and PPMZ go significantly negative in one or two cases, their asymptotic behavior is unclear.
Monotonicity As shown in Fig. 6, all four compressors solidly satisfy the monotonicity property, with \(|C(XY)| - |C(X)| > 0\) in all cases.
Our experiments have shown serious violation of the idempotence axiom that has been used to prove theoretical properties of NCD, leaving a potential gap between theory and practice. The next section explores the extent to which NCD can be useful in spite of this gap.
Classification using NCD with abnormal compressors
We have demonstrated that none of the compression algorithms we explored satisfy the requirements for normal compression. The question remains whether this contraindicates their use with NCD. As mentioned above, much previous work has demonstrated NCD’s utility with some of these compression algorithms in applications with small file sizes. However, the compressors’ deviation from normality grows with file size. Do they remain useful with larger files?
To address this question, we explored the accuracy of NCD in identifying the malware family of APK files from the Android Malware Genome Project dataset [24, 25]. In particular, we took a subset of 500 samples from the Geinimi, DroidKungFu3, DroidKungFu4, and GoldDream families.Footnote 2 We used the complete raw APK files, without modification, as our samples. Geinimi samples in this dataset have size up to 14.1 MB, DroidKungFu3 up to 15.4 MB, DroidKungFu4 up to 11.2 MB, and GoldDream up to 6.4 MB.
We evaluated NCD with the same four compression algorithms as above, using a nearest neighbor classifier [8] with a single (randomly selected) instance of each malware family in the reference set.Footnote 3 Note that we intentionally restricted the reference set to make the classification problem difficult, in order to explore the limitations of the compression algorithms when used with NCD. Results are shown in Fig. 7. In spite of clearly violating the idempotence property, both lzma and PPMZ performed significantly better than random guessing. In line with their relative normality, lzma performed best at, 59.7 % with PPMZ up next at 44.4 %. Although bz2 is slightly closer to satisfying the idempotence property than zlib, zlib actually outperformed bz2, albeit not by much, with accuracies of 33.3 and 29.8 %, respectively, with neither performing much better than random guessing.
To demonstrate the relevance of file size, we performed the same test with one slight change, this time using only reference samples smaller than 200 KB. We saw drastic improvement with bz2 (now 75.4 %), lzma (82.5 %), and PPMZ (66.7 %), while zlib’s performance actually got worse (29.2 %).
Finally, looking only at files smaller than 200 KB yielded improved performance by bz2 (89.7 %), zlib (37.9 %), and PPMZ (75.9 %), but lzma actually performed slightly worse (75.9 %). The latter suggests that file size is not the only factor that can inhibit the performance of a compression algorithm with NCD. Notably, bz2 outperformed lzma on these files. These results are shown in Fig. 8.