On normalized compression distance and large malware
Normalized Compression Distance (NCD) is a popular tool that uses compression algorithms to cluster and classify data in a wide range of applications. Existing discussions of NCD’s theoretical merit rely on certain theoretical properties of compression algorithms. However, we demonstrate that many popular compression algorithms do not seem to satisfy these theoretical properties. We explore the relationship between some of these properties and file size, demonstrate that this theoretical problem is actually a practical problem for classifying malware with large file sizes, and propose some variants of NCD that mitigate this problem.
In the era of big data, techniques that allow for data understanding without domain expertise enable more rapid knowledge discovery in the sciences and beyond. One technique that holds such promise is the Normalized Compression Distance (NCD) , which is a similarity measure that operates on generic file objects, without regard to their format, structure, or semantics.
NCD approximates the Normalized Information Distance, which is universal for a broad class of similarity measures. Specifically, the NCD measures the distance between two files via the extent to which one can be compressed given the other, and can be calculated using standard compression algorithms.
NCD, and its open source implementation CompLearn  have been widely applied for clustering, genealogy, and classification in a wide range of application areas. Its creators originally demonstrated its application in genomics, virology, languages, literature, music, character recognition, and astronomy . Subsequent work has applied it to plagiarism detection , image distinguishability , machine translation evaluation , database entity identification , detection of internet worms , malware phylogeny , and malware classification  to name a few.
Assuming some simple properties of the compression algorithm used, the NCD has been shown to be, in fact, a similarity metric . However, it remains to be seen whether real word compression algorithms actually satisfy these properties, particularly in the domain of large files. As data storage has become more affordable, large files have become more common, and the ability to analyze them efficiently has become imperative. Music recommendation systems work with MP3s which are typically several megabytes in size, medical images may be up to 30 MB or more , and computer programs are often more than 100 MB in size.
This paper explores the relationship between file size and the behavior of NCD, and proposes modifications to NCD to improve its performance on large files; these improvements are demonstrated on two malware classification problems.
Section 2 provides an introduction to NCD and the compression algorithm axioms that have been used for proving it to be a similarity metric. Section 3 explores the extent to which several popular (and not-so-popular) compression algorithms satisfy these axioms and investigates the impact of file size on its effectiveness for malware classification. Finally, Sect. 4 proposes two possible adaptations of the NCD definition, for the purpose of improving its performance on large files, and demonstrates significant performance improvement with several compressors on two malware classification problems.
2 NCD Background
The motivating idea behind the Normalized Compression Distance (NCD) is that the similarity of two objects can be measured by the ease with which one can be transformed into the other. This notion is captured formally by the information distance, E(X, Y), between two strings, X, Y, which is the length of the shortest program that can compute Y from X or X from Y in some fixed programming language. The information distance generalizes the notion of Kolmogorov complexity, where K(X) is the length of the shortest program that computes X, and intuitively captures a very general notion of what it means for two objects to be similar.
Idempotence: \(|C(XX)| = |C(X)|\) and \(|C(\lambda )| = 0\), where \(\lambda \) is the empty string.
Monotonicity: \(|C(XY)| \ge |C(X)|\).
Symmetry: \(|C(XY)| = |C(YX)|\).
Distributivity: \(|C(XY)| + |C(Z)| \le |C(XZ)| + |C(YZ)|\).
While using these properties to prove that NCD is a similarity metric is well beyond the scope of this paper, it may be worthwhile to shed some intuition on the role they play. Symmetry and distributivity correspond closely the properties that comprise the definition of a metric. Most simple is the property of symmetry: it makes little sense to talk about the distance between two objects if that distance changes depending which object one starts with. Somewhat less intuitively, the distributivity property is related to the triangle inequality, which essentially says that the shortest distance between two objects is a straight line. The monotonicity property provides for consistent behavior of compression, assuring that if you add more data, the compressed size doesn’t decrease. Finally, the idempotence property says simply that if an object comprises a simple duplication of a smaller object, the compression algorithm should be able to take advantage of that, and come close to compressing it to the size to which it can compress the smaller object. (E.g., if a file contains the string “abcdefg.abcdefg.”, one should be able to compress it in the spirit of “2*abcdefg.”.) While this seems intuitive enough, we will see that idempotence is not so simple to achieve in practice.
The question remains whether existing compression algorithms satisfy these axioms, particularly in the domain of large files. While NCD has apparently been quite successful in practice, the majority of applications (see Sect. 1) have been on relatively small files. Notably, music applications [6, 7], used MIDI files rather than the more common, and much larger, MP3 format.
Previous work  explored the NCD distance from a file to itself (which is closely related to the idempotence axiom) for bzip, zlib, and PPMZ on the Calgary Corpus , comprising 14 files, the largest of which is under 1 MB. The following section explores these axioms on a larger and more representative dataset and investigates the practical impact of deviations from normality.
3 Application of NCD to large files
3.1 Normality of compression algorithms
The definition of a normal compressor deals with asymptotic behavior, allowing for an \(O(\log (n))\) discrepancy in the axioms of idempotence, monotonicity, symmetry, and distributivity. Thus, in theory, experimental validation (or refutation) of these axioms is not truly feasible – perhaps the behavior changes when the file size is beyond that of the largest file in our experiment. Nonetheless, we endeavor to experimentally explore these axioms more extensively than has been done in prior work.
Idempotence Figures 1 and 2, show the difference in the sizes of C(X) and C(XX), and \(\log (|XX|)\), for a representative subset of files X in the dataset, with C ranging over compression algorithms bzip2 , lzma , PPMZ , and zlib . Indeed, bz2 and zlib quite apparently fail the idempotence axiom, with |C(XX)| growing much faster than |C(X)|, with a term of \(O(\log (|XX|))\) unable to put a dent in the difference. While PPMZ and lzma appear significantly better for smaller file sizes, still, this value grows much faster than \(\log (|XX|)\), as apparent in Fig. 2. We see that lzma makes a large jump around 8 MB (but even before that, its growth is much larger than the \(\log \) function).
Our experiments have shown serious violation of the idempotence axiom that has been used to prove theoretical properties of NCD, leaving a potential gap between theory and practice. The next section explores the extent to which NCD can be useful in spite of this gap.
3.2 Classification using NCD with abnormal compressors
We have demonstrated that none of the compression algorithms we explored satisfy the requirements for normal compression. The question remains whether this contraindicates their use with NCD. As mentioned above, much previous work has demonstrated NCD’s utility with some of these compression algorithms in applications with small file sizes. However, the compressors’ deviation from normality grows with file size. Do they remain useful with larger files?
To address this question, we explored the accuracy of NCD in identifying the malware family of APK files from the Android Malware Genome Project dataset [24, 25]. In particular, we took a subset of 500 samples from the Geinimi, DroidKungFu3, DroidKungFu4, and GoldDream families.2 We used the complete raw APK files, without modification, as our samples. Geinimi samples in this dataset have size up to 14.1 MB, DroidKungFu3 up to 15.4 MB, DroidKungFu4 up to 11.2 MB, and GoldDream up to 6.4 MB.
To demonstrate the relevance of file size, we performed the same test with one slight change, this time using only reference samples smaller than 200 KB. We saw drastic improvement with bz2 (now 75.4 %), lzma (82.5 %), and PPMZ (66.7 %), while zlib’s performance actually got worse (29.2 %).
4 Adapting NCD to handle large files
We saw in Sect. 3.2 that NCD has widely varying performance on large files, depending on the compression algorithm used. The memory limitations of the algorithm are key here. The major hurdle is to effectively use information from string X for the compression of string Y in computing C(XY). Algorithms like bz2 and zlib have an explicit block size as a limiting factor; if \(|X| > \) block_size, then there is no hope of benefiting from any similarity between X and Y. In contrast, lzma doesn’t have a block size limitation, but instead has a finite dictionary size; as it processes its input, the dictionary grows. Once the dictionary is full, it is erased and the algorithm starts with an empty dictionary at whatever point it has reached in its input. Again, if this occurs before reaching the start of Y, hope of detecting any similarity between X and Y is lost. Likewise, even if X is small, but Y is large, with the portion of Y that is similar to X appearing well into Y, the similarity can’t be detected.
Comparison of performance of different combining functions with NCD in a 1-NN classifier for Android malware family identification, with varying block sizes (block sizes in thousands of KB)
Comparison of performance of different combining functions with NCD in a 1-NN classifier for Windows malware family identification, with varying block sizes (block sizes in thousands of KB)
4.1 NCD adaptation results in malware classification
We repeated this experiment with 500 samples from the Lollipop, Kelihos_ver3, and Gatak Windows malware families, from Microsoft’s kaggle BIG 2015 Malware Classification Challenge dataset4 . These samples consisted of Windows binaries with their headers removed. (We did not use the disassembly files that were also included in the data set.) Note that these files, all under 4 MB, are not as large as the Android malware files. As shown in Fig. 10 and Table 2, our techniques boosted zlib from 50.5 % accuracy to 83.9 %, PPMZ from 89.1 to 89.5 %, and lzma from 90.7 to 90.9 %, but offered no improvement with bz2. (Note that the y-axis in Fig. 10 has been truncated in an attempt to allow small differences to be visible.) With the smaller size of these files, and with all but zlib doing reasonably well with the standard NCD to begin with, it is not surprising that these improvements are less dramatic than the results with the larger Android malware files.
5 Conclusion and future directions
We have demonstrated that several compression algorithms, lzma, bz2, zlib, and PPMZ, apparently fail to satisfy the properties of a normal compressor, and explored the implications of this on their capabilities for classifying malware with NCD. More generally, we have shown that file size is a factor that hampers the performance of NCD with these compression algorithms. Specifically, we found that lzma performs best on this classification task when files are large (at least in the range we explored), but that bz2 performs best when files are sufficiently small. We have also found zlib to generally not be useful for this task. PPMZ, in spite of being the top performer in terms of idempotence, did not come close to the most accurate compressor in any case. Finally, we introduced two simple file combination techniques that improve the performance of NCD on large files with each of these compression algorithms.
However, the challenges of choosing the optimal compression algorithm and the optimal combination technique (and parameters therefor) remain. For supervised classification applications, it is easy enough to use a test set to aid in the selection of the technique and block size parameter for the relevant domain. However, for clustering or genealogy tasks, the burden remains to study several resulting clusterings or hierarchies to determine which is most appropriate.
It remains for future work to better understand what properties of a data set make it more or less amenable to the different compression algorithms and different combination techniques and parameters.
Nonetheless, these techniques offer enhanced NCD performance in malware classification (as well as other tasks) with large files, and suggest that further research in this direction is worth pursuing.
These are standard corpora for the evaluation of compression algorithms and are available at http://www.data-compression.info/Corpora/.
We selected these families due to their containing enough samples to allow for a meaningful test, and containing large enough files to challenge the compressors.
For readers unfamiliar with nearest neighbor classification, specifically we classified a “test” sample by looking at the distance between it and each of the “reference” samples, and selecting the family of the nearest (i.e. most similar) reference sample.
Although use of kaggle datasets is normally restricted to the corresponding competition, Microsoft has granted permission for this dataset to be used for academic work.
The author thanks her colleagues at CyberPoint Labs, Mark Raugas, Mike West, Charlie Cabot, James Ulrich, David Ritch, Elizabeth Hughes, and Ian Blumenfeld, for their enthusiastic support and helpful input at various stages of this work.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.