Computational analysis of the SARS-CoV-2 and other viruses based on the Kolmogorov’s complexity and Shannon’s information theories


This paper tackles the information of 133 RNA viruses available in public databases under the light of several mathematical and computational tools. First, the formal concepts of distance metrics, Kolmogorov complexity and Shannon information are recalled. Second, the computational tools available presently for tackling and visualizing patterns embedded in datasets, such as the hierarchical clustering and the multidimensional scaling, are discussed. The synergies of the common application of the mathematical and computational resources are then used for exploring the RNA data, cross-evaluating the normalized compression distance, entropy and Jensen–Shannon divergence, versus representations in two and three dimensions. The results of these different perspectives give extra light in what concerns the relations between the distinct RNA viruses.


In December 2019, a mysterious pneumonia with unknown etiology was reported in the city of Wuhan, Province of Hubei, China [1]. The International Committee on Taxonomy of Virus (ICTV) named the virus as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [2]. Globally, to the 30th of July 2020, according to the World Health Organization Coronavirus Disease 2019 (COVID-19) Situation Report, 10,185, 374 confirmed cases of COVID-19 are reported, resulting in 503,862 deaths.

The scientific community reacted as never before, and many researchers focused on this urgent topic [3,4,5,6,7,8,9,10]. The mathematical and computer science communities are also studying this challenging problem, and we testimony the recent emergence of new models and algorithmic approaches. This common multidisciplinary research will allow the human kind to implement a robust and fast response to a problem that is causing a severe global socioeconomic disruption, and most probably will lead to the largest global recession since the Great Depression.

The present paper follows this trend by studying the genetic information by means of mathematical and computational tools. The starting point is the information encoded in the ribonucleic acid (RNA), a polymeric molecule essential in various biological roles.

The evolutionary origin and divergence of eucaryotes is mostly recoverable from their genetic relationships. The phylogeny of core genes, such as those for ribosomal proteins, provides a reasonable representation of many billions of years [11]. Unfortunately, the diversity of viruses prevents such a reconstruction of virus evolutionary histories as they lack any equivalent set of universally conserved genes on which to construct a phylogeny. Viral diversity is far greater than that of other organisms, with significant differences in their genetic material, RNA or deoxyribonucleic acid (DNA), and configurations (double or single stranded), as well as the orientation of their encoded gene. The smallest virus genomes [12] contain over 2,500 genes [13]. The RNA is a nucleic acid that is essential to all forms of life, and it is found in nature generally as single-stranded (ss) rather than a paired double-stranded (ds) like DNA. In DNA, there are four bases: adenine (A) complementary to thymine (T) and cytosine (C) complementary to guanine (G). In the RNA, uracil (U) is used instead of thymine.

Like DNA, RNA can carry genetic information. An RNA virus is a virus that has RNA as its genetic material encoding a number of proteins. This nucleic acid is usually single-stranded RNA (ssRNA), but there are double-stranded RNA (dsRNA) viruses. These viruses exploit the presence of RNA-dependent RNA polymerases in eucaryotes cells for replication of their genomes. Many human diseases are caused by such RNA viruses such as influenza, severe acute respiratory syndrome (SARS), COVID-19, Ebola disease virus, chikungunya, Zika virus, influenza B and Lassa virus [14].

The RNA is usually sequenced indirectly by copying it into complementary DNA (cDNA), which is often amplified and subsequently analyzed using a variety of DNA sequencing methods. Therefore, the genomic sequences of the RNA viruses are published presenting the four bases, namely adenine (A), cytosine (C), guanine (G) and thymine (T).

This genetic information is analyzed by means of the Kolmogorov’s complexity and Shannon’s information theories. In the first case, the so-called normalized information distance (NID) is adopted. In the second, a statistical approach is considered, by constructing histograms for the relative frequency of three consecutive bases (triplets). The histograms are interpreted under the light of entropy, cumulative residual entropy and Jensen–Shannon divergence. The results obtained for each theory, that is, the values assessing the virus genetic code under the light of the Kolmogorov’s complexity and the Shannon’s information, are further processed by means of advanced computational representation techniques. The final visualization is obtained using the hierarchical clustering (HC) [15,16,17,18,19,20] and multidimensional scaling (MDS) techniques [21,22,23,24,25,26,27]. Three alternative representations, namely dendrograms, hierarchical trees and three-dimensional loci, are considered.

According to the ICTV, human coronavirus belongs to the Betacoronavirus genus, a member of the Coronaviridae family, categorized in the order Nidovirales [28]. It has been categorized into several genera, based on phylogenetic analyses and antigenic criteria, namely: (i) Alphacoronavirus, responsible for gastrointestinal disorders in humans, dogs, pigs and cats; (ii) Betacoronavirus, including the bat coronavirus, the human severe acute respiratory syndrome (SARS) virus, the Middle Eastern respiratory syndrome (MERS) virus and now the SARS-CoV-2; (iii) Gammacoronavirus, which infects avian species; and iv) Deltacoronavirus [2, 29].

Four coronaviruses broadly distributed among humans (229E, OC43, NL63 and, HKU1) frequently cause only common cold symptoms [2, 12]. The two other strains of coronaviruses are linked with deadly diseases and zoonotic in origin, i.e., the SARS-CoV and MERS-CoV [30]. In 2002–2003, there was an outbreak of SARS beginning in Guangdong Province in China and affecting 27 countries subsequently [31]. It was considered the first pandemic event of the twenty-first century, due to the SARS-CoV infecting 8098 individuals with 774 deaths [31]. In 2012, in the Middle East, MERS-CoV caused a severe respiratory disease that affected 2494 individuals and 858 deaths [2]. In both epidemics, bats were the original host of these two coronaviruses [2].

Coronaviruses contain a positive-sense, single-stranded RNA genome. The genome size for coronaviruses ranges from 26.4 to 31.7 kilobases, and it is one of the largest among RNA viruses.

The rapid spread of SARS-CoV-2 raises intriguing questions such as whether its evolution is driven by mutations, recombination or genetic variation [32]. This information is now being applied to the development of diagnostic tools, effective antiviral therapies and in the understanding of viral diseases. Although numerous studies have been done from the biological and medical perspective, to help further understand SARS-CoV-2 and trace its origin, this paper reports the use of multidimensional scale techniques for the finding of the similarities and the relationships among COVID-19 strains themselves and between other described viruses.

For that, we have collected a set of 37 complete genome sequences of SARS-CoV-2 virus obtained in several countries from patients with COVID-19. To help verify whether it is possible to trace the original or intermediate host of SARS-CoV-2, we obtained the genomic sequences of SARS-CoV-2 virus in other hosts, including bats (16 genomic sequences), pangolins (8 genomic sequences) and environment (market of Wuhan) [13]. We have also collected 23 genomic sequences of the coronavirus that cause mild symptoms related to common cold in man (229E, OC43, NL63 and, HKU1). The genomic sequences of SARS-CoV (10 genomic sequences) and a MERS-CoV (13 genomic sequences) were also gathered. For comparison and control, we also obtained sequences of other deadly pathogenic RNA viruses such as Lassa (6 genomic sequences), Ebola (6 genomic sequences), dengue (7 genomic sequences), chikungunya (1 genomic sequence) and influenza B (2 genomic sequences).

The diagram of Fig. 1 summarizes the main historical flow of coronavirus in what concerns the twenty-first-century epidemics.

Fig. 1

Coronavirus related to twenty-first-century epidemics. ICST: The International Committee on Taxonomy of Virus—Coronavirus Study Group, WHO: World Health Organization

To the authors’ best knowledge, this paper analyzes for the first time a large number of viruses associated with the combination of several distinct mathematical and computational tools. In short, we have the cross-comparison of:

  • The genomic sequences of 133 viruses.

  • The data treatment by means of Kolmogorov’s complexity and Shannon’s information theories, using normalized compression distance, entropy, cumulative residual entropy and Jensen–Shannon divergence.

  • Two clustering computational techniques, namely the hierarchical clustering and multidimensional scaling.

  • The visualization of the results by means of dendrograms, trees and point loci.

The results indicate clearly the superior performance of the approaches based on the Kolmogorov’s complexity measure and the MDS three-dimensional visualization. Moreover, the characteristics of the coronaviruses within the large set of tested cases are highlighted. We conclude that:

  • The association between the Kolmogorov perspective and the three-dimensional MDS representation leads to be the best results.

  • The clusters are easily distinguishable and we observe the relation between the new SARS-CoV-2 virus and some CoV found in bats and in the pangolin.

Bearing these ideas in mind, the paper is organized as follows. Section 2 introduces the mathematical and computational tools adopted in the follow-up. Section 3 analyzes the data by means of the complexity and information theories accompanied by the HC and MDS computational visualization resources. Finally, Sect. 4 summarizes the main conclusions.

Fundamental tools

Distance metrics

Evaluating the similarity degree between several objects, each having a number of features requires the definition of a distance. The calculation of a function d of two objects \(x_A\) and \(x_B\) can be interpreted as a distance if \(d(x_A,x_B)\ge 0\) and satisfies the following axioms [33]:

$$\begin{aligned} \begin{aligned}&\hbox {identity}: \, d(x_A,x_B)= 0 \hbox { if } x_A\ = x_B\\&\hbox {symmetry}: \, d(x_A,x_B)= d(x_B,x_A) \\&\hbox {triangle inequality}: \, d(x_A,x_B)\le d(x_A,x_C)\\&\quad + d(x_C,x_A). \end{aligned} \end{aligned}$$

We find in the literature a variety of different functions that can tackle datasets and shed light to distinct characteristics [34]. For a set of numerical vectors \(\left( x_{1},\ldots ,x_{N}\right) ^{T}\in \mathbb {R}^{n}\), the Minkowski norm \(L_n: \left( \sum _{k=1}^{N}\left| x_{k}\right| ^n\right) ^{\frac{1}{n}}\), and in particular the Manhattan and Euclidean cases, \(L_1\) and \(L_2\) for \(n=1\) and \(n=2\), are often used [35]. In the case of DNA analysis, these norms support different algorithms [36,37,38] that allow the comparison of data sequences. We can also mention metrics such as the Chi-square [39], Hamming [40] and edit [41] distances. Nonetheless, the selection of the optimal distances for a specific application poses relevant challenges [42,43,44,45]. In fact, the adoption of a given metric often depends on the user experience that performs several numerical experiments before selecting one or more distances. Due to these issues, assessing the similarity of several objects is not a straightforward process and we can adopt both non-probabilistic and probabilistic information measures to obtain distinct perspectives between object similarities.

Kolmogorov complexity theory

The Kolmogorov complexity, or algorithmic entropy, addresses the information measurement without relying on probabilistic assumptions. The information measurement focuses an individual finite object described by a string and takes into consideration that ‘regular’ strings are compressible [46]. The Kolmogorov complexity of a string x, denoted as K(x), can be defined as the length of the shortest binary program that, given an empty string \(\psi \) at its input, can compute x on its output in a universal computer, and then halts. Therefore, K(x) can be interpreted as the length of the ultimate compressed version of x.

The information distance of two strings (or files) \(\lbrace x_A,x_B \rbrace \in \varSigma \), where \(\varSigma \) denotes the space of the objects, can also be computed by means of the conditional Kolmogorov complexity \(K(x_A|x_B)\) [47, 48]. This concept can be read as the length of the shortest program to obtain \(x_A\), if \(x_B\) is provided for input. In heuristic terms, if the two strings are more/less similar, then the calculation is less/more difficult and, consequently, the size of the program is smaller/larger. Therefore, the following inequality always holds

$$\begin{aligned} K(x_A|x_B) \le K(x_A). \end{aligned}$$

Under the light of these concepts, the universal distance metric [47] denoted as normalized information distance (NID):

$$\begin{aligned}&NID(x_A,x_B)\nonumber \\&\quad = \frac{\max \lbrace K(x_A|x_B),K(x_B|x_A)\rbrace }{\max \lbrace K(x_A),K(x_B)\rbrace }, \ \ \lbrace x_A,x_B \rbrace \in \varSigma , \end{aligned}$$

was proposed.

From equation (2), we have that the NID may take values in the range [0, 1]. Moreover, we have \(NID(x_A,x_A)\approx 0\) and \(NID(x_A,\psi )\approx 1\), where \(\psi \) is an empty object that has no similarity to \(x_A\). It is shown [47] that the NID is a distance because it satisfies the axioms defined in (1), up to some additive precision, but is non-computable [47]. In spite of this limitation, the NID gives the basis for the so-called normalized compression distance (NCD), which is a computable distance [45]. The computability comes with the cost of using a good approximation of the Kolmogorov complexity by a standard compressor \(C(\cdot )\) (interested readers for the discussion between the equivalence of the NID and the NCD can find further details in [33]).

The NCD is given by:

$$\begin{aligned} NCD(x_A,x_B) = \frac{C(x_Ax_B)-\min \lbrace C(x_A),C(x_B)\rbrace }{\max \lbrace C(x_A),C(x_B) \rbrace }. \end{aligned}$$

The NCD has values in the range \(0<NCD(x,y)<1+ \epsilon \), assessing the distance between the files \(x_A\) and \(x_B\). The parameter \(\epsilon > 0\) reflects ‘imperfections’ in the compression algorithm. The values of \(C(x_A)\) and \(C(x_B)\) are the sizes of each of the compressed files \(x_A\) and \(x_B\), respectively, and \(C(x_Ax_B)\) is the compressed size of the two concatenated files considered by the compressor as a single file.

Let us consider, for example, that \(C(x_B)\ge C(x_A)\). Expression (4) says that the distance \(NCD(x_A,x_B)\) assesses the improvement due to compressing \(x_B\) using data from the previously compressed \(x_A\) and compressing \(x_B\) from scratch, expressed as the ratio between their compressed sizes.

Obviously, the approximation of the NID by means of the NCD poses operational obstacles. Due to the non-computability of the Kolmogorov complexity, we cannot predict how close is the NCD to the real value of the NID, and the approximation may yield arithmetic problems, particularly in the case of small strings where numerical indeterminate forms may arise [33, 49]. Moreover, the compressor (as an approximation of the Kolmogorov complexity) must be ‘normal’ in the sense that given the object \(x_Ax_A\) the compressor C should produce an object with almost to an identical size to the compressed version of \(x_A\). This is a limitation for the universality of the NCD since in specific applications the best performing loss-less algorithms (e.g., JBIG, JPEG2000 and JPEG-LS in image compression) do not satisfy such propriety [50]. Nevertheless, key results were already obtained using the NCD in image distinguishability [51], image OCR [52], malware recognition [53] and genomic analysis [54, 55].

Shannon information theory

Information theory [56] has been applied in a variety of scientific areas. The most fundamental piece of the theory is the information content I of a given event \(x_i\) having probability of occurrence \(P\left( X=x_{i}\right) \):

$$\begin{aligned} I \left( X=x_{i}\right) =-\ln \left( P\left( X=x_{i}\right) \right) , \end{aligned}$$

where X is a discrete random variable.

The expected value of the information, the so-called Shannon entropy [57, 58], is given by:

$$\begin{aligned} H\left( X\right)= & {} E\left( -\ln \left( X\right) \right) \nonumber \\= & {} { -\sum _{i} P\left( X=x_{i}\right) \ln \left( P\left( X=x_{i}\right) \right) } , \end{aligned}$$

where \(E\left( \cdot \right) \) denotes the expected value operator.

Expression (6) obeys the four Khinchin axioms [59, 60]. Several generalizations of Shannon entropy have been proposed, relaxing some of those axioms, and we can mention the Rényi, Tsallis and generalized entropy [61,62,63], just to name a few.

A recent and interesting concept is the cumulative residual entropies \(\varepsilon \) given by [64, 65]:

$$\begin{aligned} \varepsilon \left( X\right) ={ -\sum _{i}P\left( X> x_{i}\right) \log \left( P\left( X > x_{i}\right) \right) }. \end{aligned}$$

Within the scope of information theory, we can formulate also the concept of distance discussed in Sect. 2.1. The Kullback–Leibler divergence between the probability distributions \(X_1\) and \(X_2\) is defined as [34, 66,67,68,69]:

$$\begin{aligned} D_{KL}\left( X_1\parallel X_2 \right) ={ \sum _{i}P\left( X_1 = x_{i}\right) \ln \left( \frac{P\left( X_1=x_{i}\right) }{P\left( X_2=x_{i}\right) } \right) }. \end{aligned}$$

From this, we obtain the Jensen–Shannon divergence, \(JSD\left( X_1 \parallel X_2 \right) \), or distance, given by:

$$\begin{aligned}&JSD\left( X_1\parallel X_2\right) \nonumber \\&\quad =\frac{1}{2}\left[ D_{KL}\left( X_1\parallel X_{12} \right) +D_{KL}\left( X_2\parallel X_{12} \right) \right] , \end{aligned}$$

where \(X_{12}=\frac{1}{2}\left( X_1+X_2 \right) \).

Alternatively, the \(JSD\left( X_1\parallel X_2\right) \) can be calculated as:

$$\begin{aligned} JSD\left( X_1\parallel X_2\right)= & {} \frac{1}{2} \sum _{i} P\left( X_1 = x_{i}\right) \ln \left( P\left( X_1=x_{i}\right) \right) \nonumber \\&+\frac{1}{2} \sum _{i} P\left( X_2 = x_{i}\right) \ln \left( P\left( X_2=x_{i}\right) \right) \nonumber \\&- \sum _{i} P\left( X_{12} = x_{i}\right) \ln \left( P\left( X_{12}=x_{i}\right) \right) . \end{aligned}$$

Hierarchical clustering, multidimensional scaling and visualization

The HC is a computational technique that assesses a group of N objects \(X_i\), \(i =1, \cdots , N\), in a q-dimensional space, and tries to rearrange them in a visual structure with objects \(Y_i\) highlighting the main similarities between them in the sense of some predefined metric [70].

Let us consider N objects defined in a q-dimensional real-valued space and a distance (or dissimilarity) measure \(\delta _{ij}\) between pairs of objects. The first step is to calculate a \(N \times N\)-dimensional matrix, \(\varDelta =[\delta _{ij}]\), where \(\delta _{ij}\in \mathbb {R^{+}}\) for \(i\ne j\) and \(\delta _{ii}=0\), \((i,j)=1, \cdots , N\), stands for the object to object distances [71]. The HC uses the input information in matrix \(\varDelta \) and produces a graphical representation consisting in a dendrogram or a hierarchical tree.

The so-called agglomerative or divisive clustering iterative techniques are usually adopted for processing the information. In the first approach, each object starts in its own cluster and the computational iterations merge the most similar items until having just one cluster. In the second, all objects start their own cluster and the computational iterations separate items, until each object has his own cluster. For both approaches, the numerical iterations follow a linkage criterion, based on the distances between pairs, for quantifying the dissimilarity between clusters. The maximum, minimum and average linkages are possible criteria. The distance \(d\left( x_{A},x_{B}\right) \) between two objects \(x_{A} \in A\) and \(x_{B} \in B\) can be assessed by means of several metrics such as the average linkage given by [72]:

$$\begin{aligned} d_{av}\left( A,B\right) =\frac{1}{\left\| A\right\| \left\| B\right\| }{ \sum _{x_{A}\in A,x_{B}\in B}d\left( x_{A},x_{B}\right) }. \end{aligned}$$

The clustering quality can be assessed by means of the cophenetic correlation [73]:

$$\begin{aligned} c =\frac{{ \sum _{i<j}\left( x(i,j)-\bar{x}\right) \left( y(i,j)-\bar{y}\right) }}{\sqrt{\left[ { \sum _{i<j}\left( x(i,j)-\bar{x}\right) ^{2}}\right] \left[ { \sum _{i<j}\left( y(i,j)-\bar{y}\right) ^{2}}\right] }}, \end{aligned}$$

where x(ij) and y(ij) stand for the distances between the pairs \(X_i\) and \(X_j\), in the initial measurements, and \(Y_i\) and \(Y_j\), in the HC chart, respectively, and \(\bar{x}\) denotes the average of x.

Values of c close to 1 (to 0) indicate a good (weak) cluster representation of the original data. In MATLAB, c is computed by means of the command cophenet.

In this paper, we adopt the agglomerative clustering and the average linkage [74, 75] for tackling the matrix of distances based on the JSD metric (10).

The MDS is also a computational technique for clustering and visualizing multidimensional data [76]. Similarly to what was said for the HC, the input of the MDS is the matrix \(\varDelta =[\delta _{ij}]\), \((i,j)=1, \cdots , N\). The MDS is to adopt points for representing the objects in a d-dimensional space, with \(d < q\), that try to reproduce the original distances, \(\delta _{ij}\). The MDS performs a series of numerical iterations rearranging the points in order to optimize a given cost function called stress S. We have, for example, the residual sum of squares and the Sammon criteria:

$$\begin{aligned} S= & {} \left[ \sum _{i<j}\left( \phi _{ij}-\delta _{ij}\right) ^{2}\right] ^{\frac{1}{2}},\end{aligned}$$
$$\begin{aligned} S= & {} \left[ \frac{ \sum _{i<j}\left( \phi _{ij}-\delta _{ij}\right) ^{2} }{ \sum _{i<j}\phi _{ij}^{2}} \right] ^{\frac{1}{2}}. \end{aligned}$$

The resulting MDS points have coordinates that produce a symmetric matrix \(\varPhi =[\phi _{ij}]\) of distances that approximate the original one \(\varDelta =[\delta _{ij}]\). In MATLAB, the commands cmdscale and Sammon can be adopted for the classical MDS and the Sammon stress criterion.

The interpretation of the MDS locus is based on the patterns of points [77, 78]. Therefore, objects with strong (weak) similarities are represented by fairly close (distant) points. The MDS locus is interpreted on the relative positions of the point coordinates. So, the absolute position of the points or the shape of the clusters has usually a special meaning, and we can magnify, translate and rotate the locus for achieving a good visualization. In the same line of reasoning, the axes of the plot do not have units or physical meaning. The quality of the produced MDS locus can be evaluated using the stress plot and the Shepard diagram. The stress plot shows S versus d and decreases monotonically. Usually, low values of d are adopted, and present computational resources allow a direct three-dimensional representation, but often some rotations and magnification are required to achieve the best visual perspective. The Shepard diagram plots \(\phi _{ij}\) against \(\delta _{ij}\) for a given value of d, and a narrow (large) scatter of points indicates a good (weak) fit between the original and the reproduced distances.

The dataset

The information of 133 publicly released genomic sequences was collected in the Global Initiative on Sharing Avian Influenza Data (GISAID) and the GenBank of the National Center for Biotechnology Information (NCBI) databases (, The information regarding the sequences and serial numbers is given in Table of “Appendix.”

The mathematical tools of Kolmogorov and Shannon theories are used to compare and extract relationships among the data and to identify viral genomic patterns. The viral genomes are analyzed from the perspective of dynamical systems using HC and MDS. Dendrograms and trees are generated by HC algorithms, and a three-dimensional visualization through MDS visualization. Several clusters with medical and epidemiological interest were visualized. The MDS loci provide a key visualization of the relation of SARS-CoV-2 with the other known coronavirus affecting humans.

The several non-coronavirus pathogenic viruses analyzed are very well delimited in several independent and easily delineated clusters (Zika, Chikungunya, Dengue, Lassa, Ebola, Influenza B).

The several types of coronaviruses that cause mild disease (common flu) are very well separated from the other coronavirus clusters. Interestingly, the types, 229E, HkU1, NL63 and OC43, form separated mini-clusters within a big well-defined cluster

There are two well-defined CoV-19 or SARS-CoV clusters. The two environmental SARS-CoV-2 were aggregated with the human SARS-CoV-2. To differentiate the human MERS-CoV, a zoom was made to isolate it forming a separate cluster. The MERS-CoV obtained from camels was included in this cluster. SARS-CoV formed another independent and well-segregated cluster. The CoV from bats and pangolin is dispersed among these several clusters of coronavirus forming sometimes clusters of few elements. Note that the bat CoV RaTG13 is near to one of the SARS-CoV clusters. This coronavirus is the closest known relative of SARS-CoV-2 [79].

For processing the RNA information consisting of ASCII files with the four nitrogenous bases, we consider two approaches. The first one follows the Kolmogorov complexity theory described in Sect. 2.2. Therefore, we consider the compressors zlib and bz2 (see and followed by the NCD distance (4). Nonetheless, the two algorithms give almost identical results, and therefore, only the zlib is considered in the follow-up.

The application of the NCD produces a symmetrical matrix \(\varDelta \), \(133 \times 133\) dimensional that is tackled and visualized by means of a dendrogram and a HC tree obtained by the program Phylip and MDS using MATLAB The corresponding charts are represented in Figs. 23 and 4.

Fig. 2

Dendrogram for the set of 133 viruses using the NCD and zlib based on the Kolmogorov complexity theory

Fig. 3

HC tree for the set of 133 viruses using the NCD and zlib based on the Kolmogorov complexity theory

Fig. 4

MDS three-dimensional locus for the set of 133 viruses using the NCD and zlib based on the Kolmogorov complexity theory with the cluster of SARS-CoV-2 connected by a line

Fig. 5

MDS three-dimensional locus for the set of 133 viruses using the NCD and zlib based on the Kolmogorov complexity theory, without point labels and the cluster of SARS-CoV-2 connected by a line

It is known that the three-dimensional plots require some rotation in the computer screen. Therefore, Fig. 5 shows the MDS locus in a different perspective, without point labels and the cluster of SARS-CoV-2 connected by a line.

The second approach follows the Shannon information theory described in Sect. 2.3. Therefore, we start by considering non-overlapping (codon or anticodon) triplets of bases and the histograms of relative frequency for each virus. In a second phase, after having the histogram for the complete set of virus, we process them using entropy concepts. Before continuing, we must clarify that we tested the adoption of n-tuples, with \(n=1,2,3,4\), in the DNA information analysis of a large set of superior species such as mammals, where the genetic information is considerably larger and the construction of histograms for large values of n does not pose problems of statistical significance (since the number of histograms bins grows as \(4^n\)). It was observed that \(n=3\) is a ‘good value,’ since \(n=1\) leads to poor results, \(n=2\) improves things but is still insufficient, while \(n=4\) gives almost identical results to \(n=3\).

In a first experiment, we consider the Shannon entropy H vs fractional cumulative residual entropy \(\varepsilon \) as possible descriptors of the 133 histograms. Figure 6 shows the resulting two-dimensional plot, that is, the Shannon entropy H vs the fractional cumulative residual entropy \(\varepsilon \), for the set of 133 viruses. We observe some separation between virus, but we have some difficulties due to the high density of points in some areas.

Fig. 6

Shannon entropy H vs fractional cumulative residual entropy \(\varepsilon \) for the set of 133 viruses, with the cluster of SARS-CoV-2 connected by a line

We now try the other computational techniques introduced in Sect. 2. In the follow-up, we consider identical input information, that is, the JDS and the corresponding matrix \(\varDelta \) for the set of 133 viruses, and we compare the distinct computational clustering and visualization techniques.

Figures 78 and 9 show the dendrogram, the HC tree and the three-dimensional MDS locus, respectively. The dendrogram is the simplest representation and it is straightforward to interpret. However, this technique does not take full advantage of the space. The HC tree uses more efficiently the two-dimensional space, but we now have more difficulties in reading the most dense clusters. The three-dimensional MDS takes advantage of the computational representation, but requires some rotation/shift/magnification in the computer screen.

Successive magnifications can be done and are necessary to achieve a more distinct visualization. Therefore, we can say that all representations have their own pros and cons, although the three-dimensional MDS is, a priori, the most efficient method.

Fig. 7

Dendrogram for the set of 133 viruses using the JSD based on the Shannon information theory

Fig. 8

HC tree for the set of 133 viruses using the JSD based on the Shannon information theory

Fig. 9

MDS three-dimensional locus for the set of 133 viruses using the JSD based on the Shannon information theory, with the cluster of SARS-CoV-2 connected by a line

Fig. 10

MDS three-dimensional locus for the set of 133 viruses using the JSD based on the Shannon information theory, without point labels and the cluster of SARS-CoV-2 connected by a line

Since the three-dimensional plot requires some rotation in the computer screen, Fig. 10 shows the MDS locus in a different perspective, without point labels and the cluster of SARS-CoV-2 connected by a line.

The MDS 3D following the Kolmogorov method isolated the several groups of virus in several extended groups. The cluster of SARS-CoV-2 is very extended. Note that the SARS-CoV-2 virus found in the market of Wuhan is very near of the SARS-CoV-2 cluster. The CoV found in the pangolin also forms a cluster mixing with the upper part of the SARS-CoV-2 cluster. Bat CoV virus forms also a diffuse cluster lacking some contiguity. The bat CoV Yunnan RaTG13 is very near the SARS-CoV-2 cluster. There is a big cluster formed by the different CoV that causes a mild respiratory disease (common cool).

Other pathogenic RNA virus clusters, i.e., Lassa virus, Zika, influenza B and Ebola, can be easily separated from the CoV clusters. However, with this methodology, the cluster formed by the dengue virus is not very far, in some views, from a part of the cluster formed by the CoV that causes mild respiratory disease.

The SARS CoV virus forms another independent cluster. MERS-CoV from humans forms another extended cluster and in the vicinity of the CoV virus from camel.


This paper explored the information of 133 RNA viruses available in public databases. For that purpose, several mathematical and computational tools were applied. On the one hand, the concepts of distance and the theories developed by Kolmogorov and Shannon for assessing complexity and information were recalled. The first involved the normalized compression distance and the zlib compressor for the DNA. The second understood the use of histograms of triplets of bases and their assessment through entropy, cumulative residual entropy and Jensen–Shannon divergence. On the other hand, the advantages of hierarchical clustering and multidimensional scaling algorithmic techniques were also discussed. Representations in two and three dimensions, namely by dendrograms and trees, and loci of points or points and lines, respectively, were compared. The results revealed their pros and cons for the specific application of the set of viruses under comparison.

The MDS 3D in the Kolmogorov perspective leads to be the best visualization method. We have not only the clusters easily distinguishable, but we find also the relation between the new SARS-CoV-2 virus and some CoV found in bats (the primary host of the virus) and in the pangolin, the likely intermediate host. The SARS-CoV-2 found in the environment, namely the Market of Wuhan where the epidemic probably started, is indistinguishable from the SARS-CoV-2 found in humans.

This type of methodology may help to study how an animal virus jumped the boundaries of species to infect humans, and pinpoint its origin, as this knowledge can help to prevent future zoonotic events [80]. The statistical and computational techniques allow different perspectives over viral diseases that may be used to grasp the dynamics of the emerging COVID-19 disease. These methodologies [79] may help interpreting future viral outbreaks and to provide additional information concerning these infectious agents and understand the dynamics of viral diseases.


  1. 1.

    Zhu, N., Zhang, D., Wang, W., Li, X., Yang, B., Song, J., Zhao, X., Huang, B., Shi, W., Lu, R., Niu, P., Zhan, F., Ma, X., Wang, D., Xu, W., Wu, G., Gao, G.F., Tan, W.: A novel coronavirus from patients with pneumonia in China, 2019. N. Engl. J. Med. 382(8), 727–733 (2020).

    Article  Google Scholar 

  2. 2.

    ur Rehman, S., Shafique, L., Ihsan, A., Liu, Q.: Evolutionary trajectory for the emergence of novel coronavirus SARS-CoV-2. Pathogens 9(3), 240 (2020).

    Article  Google Scholar 

  3. 3.

    Kandeil, A., Shehata, M.M., Shesheny, R.E., Gomaa, M.R., Ali, M.A., Kayali, G.: Complete genome sequence of middle east respiratory syndrome coronavirus isolated from a dromedary camel in Egypt. Genome Announc. (2016).

    Article  Google Scholar 

  4. 4.

    Kucharski, A.J., Russell, T.W., Diamond, C., Liu, Y., Edmunds, J., Funk, S., Eggo, R.M., Sun, F., Jit, M., Munday, J.D., Davies, N., Gimma, A., van Zandvoort, K., Gibbs, H., Hellewell, J., Jarvis, C.I., Clifford, S., Quilty, B.J., Bosse, N.I., Abbott, S., Klepac, P., Flasche, S.: Early dynamics of transmission and control of COVID-19: a mathematical modelling study. Lancet Infect. Dis. (2020).

    Article  Google Scholar 

  5. 5.

    Lam, T.T.Y., Shum, M.H.H., Zhu, H.C., Tong, Y.G., Ni, X.B., Liao, Y.S., Wei, W., Cheung, W.Y.M., Li, W.J., Li, L.F., Leung, G.M., Holmes, E.C., Hu, Y.L., Guan, Y.: Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins. Nature (2020).

    Article  Google Scholar 

  6. 6.

    Kissler, S.M., Tedijanto, C., Goldstein, E., Grad, Y.H., Lipsitch, M.: Projecting the transmission dynamics of SARS-CoV-2 through the postpandemic period. Science (2020).

    Article  Google Scholar 

  7. 7.

    Li, C., Yang, Y., Ren, L.: Genetic evolution analysis of 2019 novel coronavirus and coronavirus from other species. Infect. Genet. Evol. 82, 104285 (2020).

    Article  Google Scholar 

  8. 8.

    Peng, L., Yang, W., Zhang, D., Zhuge, C., Hong, L.: Epidemic analysis of COVID-19 in china by dynamical modeling. BMJ (2020).

    Article  Google Scholar 

  9. 9.

    Qiang, X.L., Xu, P., Fang, G., Liu, W.B., Kou, Z.: Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus. Infect. Dis. Poverty (2020).

    Article  Google Scholar 

  10. 10.

    Liu, Y., Liu, B., Cui, J., Wang, Z., Shen, Y., Xu, Y., Yao, K., Guan, Y.: COVID-19 evolves in human hosts (2020).

  11. 11.

    Segata, N., Huttenhower, C.: Toward an efficient method of identifying core genes for evolutionary and functional microbial phylogenies. PLoS ONE 6(9), e24704 (2011).

    Article  Google Scholar 

  12. 12.

    Al-Khannaq, M.N., Ng, K.T., Oong, X.Y., Pang, Y.K., Takebe, Y., Chook, J.B., Hanafi, N.S., Kamarulzaman, A., Tee, K.K.: Molecular epidemiology and evolutionary histories of human coronavirus OC43 and HKU1 among patients with upper respiratory tract infections in Kuala Lumpur, Malaysia. Virol. J. (2016).

    Article  Google Scholar 

  13. 13.

    Abergel, C., Legendre, M., Claverie, J.M.: The rapidly expanding universe of giant viruses: mimivirus, pandoravirus, pithovirus and mollivirus. FEMS Microbiol. Rev. 39(6), 779–796 (2015).

    Article  Google Scholar 

  14. 14.

    Acheson, N.H.: Fundamentals of Molecular Virology. Wiley, New York (2011)

    Google Scholar 

  15. 15.

    Defays, D.: An efficient algorithm for a complete link method. Comput. J. 20(4), 364–366 (1977).

    MathSciNet  Article  MATH  Google Scholar 

  16. 16.

    Székely, G.J., Rizzo, M.L.: Hierarchical clustering via joint between-within distances: extending Ward’s minimum variance method. J. Classif. 22(2), 151–183 (2005).

    MathSciNet  Article  MATH  Google Scholar 

  17. 17.

    Fernández, A., Gómez, S.: Solving non-uniqueness in agglomerative hierarchical clustering using multidendrograms. J. Classif. 25(1), 43–65 (2008).

    MathSciNet  Article  MATH  Google Scholar 

  18. 18.

    Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics, 2nd edn. Springer, New York (2009)

    Book  Google Scholar 

  19. 19.

    Lopes, A.M., Machado, J.A.T.: Tidal analysis using time-frequency signal processing and information clustering. Entropy 19(8), 390 (2017).

    Article  Google Scholar 

  20. 20.

    Machado, J.A.T., Lopes, A.: Rare and extreme events: the case of COVID-19 pandemic. Nonlinear Dyn. (2020).

    Article  Google Scholar 

  21. 21.

    Torgerson, W.: Theory and Methods of Scaling. Wiley, New York (1958)

    Google Scholar 

  22. 22.

    Shepard, R.N.: The analysis of proximities: multidimensional scaling with an unknown distance function. Psychometrika 27(I and II), 219–246 and 219–246 (1962)

  23. 23.

    Kruskal, J.: Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29(1), 1–27 (1964)

    MathSciNet  Article  Google Scholar 

  24. 24.

    Kruskal, J.B., Wish, M.: Multidimensional Scaling. Sage Publications, Newbury Park (1978)

    Book  Google Scholar 

  25. 25.

    Borg, I., Groenen, P.J.: Modern Multidimensional Scaling-Theory and Applications. Springer, New York (2005)

    MATH  Google Scholar 

  26. 26.

    Ionescu, C., Machado, J.T., Keyser, R.D.: Is multidimensional scaling suitable for mapping the input respiratory impedance in subjects and patients? Comput. Methods Programs Biomed. 104(3), e189–e200 (2011)

    Article  Google Scholar 

  27. 27.

    Machado, J.A.T., Dinç, E., Baleanu, D.: Analysis of UV spectral bands using multidimensional scaling. SIViP 9(3), 573–580 (2013).

    Article  Google Scholar 

  28. 28.

    Lai, M.M., Cavanagh, D.: The molecular biology of coronaviruses. In: Kielian, M., Mettenleiter, T., Roossinck, M. (eds.) Advances in Virus Research, pp. 1–100. Elsevier, Amsterdam (1997).

    Chapter  Google Scholar 

  29. 29.

    Schoeman, D., Fielding, B.C.: Coronavirus envelope protein: current knowledge. Virol. J. (2019).

    Article  Google Scholar 

  30. 30.

    Cui, J., Li, F., Shi, Z.L.: Origin and evolution of pathogenic coronaviruses. Nat. Rev. Microbiol. 17(3), 181–192 (2018).

    Article  Google Scholar 

  31. 31.

    Lau, S.K.P., Woo, P.C.Y., Li, K.S.M., Huang, Y., Tsoi, H.W., Wong, B.H.L., Wong, S.S.Y., Leung, S.Y., Chan, K.H., Yuen, K.Y.: Severe acute respiratory syndrome coronavirus-like virus in chinese horseshoe bats. Proc. Nat. Acad. Sci. 102(39), 14040–14045 (2005).

    Article  Google Scholar 

  32. 32.

    Phan, T.: Genetic diversity and evolution of SARS-CoV-2. Infect. Genet. Evol. 81, 104260 (2020).

    Article  Google Scholar 

  33. 33.

    Cilibrasi, R., Vitany, P.M.B.: Clustering by compression. IEEE Trans. Inf. Theory 51(4), 1523–1545 (2005).

    MathSciNet  Article  MATH  Google Scholar 

  34. 34.

    Deza, M.M., Deza, E.: Encyclopedia of Distances. Springer, Berlin (2009)

    Book  Google Scholar 

  35. 35.

    Cha, S.: Taxonomy of nominal type histogram distance measures. In: Proceedings of the American Conference on Applied Mathematics, pp. 325–330. Harvard, Massachusetts, USA (2008)

  36. 36.

    Yin, C., Chen, Y., Yau, S.S.-T.: A measure of DNA sequence similarity by Fourier transform with applications on hierarchical clustering complexity for DNA sequences. J. Theor. Biol. 359, 18–28 (2014).

    Article  MATH  Google Scholar 

  37. 37.

    Kubicova, V., Provaznik, I.: Relationship of bacteria using comparison of whole genome sequences in frequency domain. Inf. Technol. Biomed. 3, 397–408 (2014).

    Article  Google Scholar 

  38. 38.

    Glunčić, M., Paar, V.: Direct mapping of symbolic DNA sequence into frequency domain in global repeat map algorithm. Nucleic Acids Res. (2013).

    Article  Google Scholar 

  39. 39.

    Hu, L.Y., Huang, M.W., Ke, S.W., Tsai, C.F.: The distance function effect on k-nearest neighbor classification for medical datasets. Springer Plus (2016).

    Article  Google Scholar 

  40. 40.

    Hautamaki, V., Pollanen, A., Kinnunen, T., Aik, K., Haizhou, L., Franti, L.: A Comparison of Categorical Attribute Data Clustering Methods, pp. 53–62. Springer, New York (2014).

    Book  Google Scholar 

  41. 41.

    Aziz, M., Alhadidi, D., Mohammed, N.: Secure approximation of edit distance on genomic data. BMC Med. Genomics (2017).

    Article  Google Scholar 

  42. 42.

    Yianilos, P.N.: Normalized forms of two common metrics. Technical Report 91-082-9027-1, NEC Research Institute (1991)

  43. 43.

    Yu, J., Amores, J., Sebe, N., Tian, Q.: A new study on distance metrics as similarity measurement. In: IEEE International Conference on Multimedia and Expo, pp. 533–536 (2006).

  44. 44.

    Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.A. (eds.): Feature Extraction: Foundations and Applications. Springer, New York (2008)

    Google Scholar 

  45. 45.

    Russel, R., Sinha, P.: Perceptually based comparison of image similarity metrics. Perception 40, 1269–1281 (2011).

    Article  Google Scholar 

  46. 46.

    Kolmogorov, A.: Three approaches to the quantitative definition of information. Int. J. Comput. Math. 2(1–4), 157–168 (1968)

    MathSciNet  Article  Google Scholar 

  47. 47.

    Bennett, C.H., Gács, P., Li, M., Vitányi, P., Zurek, W.H.: Information distance. IEEE Trans. Inf. Theory 44(4), 1407–1423 (1998)

    MathSciNet  Article  Google Scholar 

  48. 48.

    Fortnow, L., Lee, T., Vereshchagin, N.: Kolmogorov complexity with error. In: Durand, B., Thomas, W. (eds.) STACS 2006–23rd Annual Symposium on Theoretical Aspects of Computer Science, Marseille, France, February 23–25, 2006. Lecture Notes in Computer Science, pp. 137–148. Springer, Berlin (2006)

  49. 49.

    Cebrián, M., Alfonseca, M., Ortega, A.: Common pitfalls using the normalized compression distance: what to watch out for in a compressor. Commun. Inf. Syst. 5(4), 367–384 (2005).

    MathSciNet  Article  MATH  Google Scholar 

  50. 50.

    Pinho, A., Ferreira, P.: Image similarity using the normalized compression distance based on finite context models. In: Proceedings of IEEE International Conference on Image Processing (2011).

  51. 51.

    Vázquez, P.P., Marco, J.: Using normalized compression distance for image similarity measurement: an experimental study. J. Comput. Virol. Hacking Tech. 28(11), 1063–1084 (2012).

    Article  Google Scholar 

  52. 52.

    Cohen, A.R., Vitányi, P.M.B.: Normalized compression distance of multisets with applications. IEEE Trans. Pattern Anal. Mach. Intell. 37(8), 1602–1614 (2015).

    Article  Google Scholar 

  53. 53.

    Borbely, R.S.: On normalized compression distance and large malware. J. Comput. Virol. Hacking Tech. 12(4), 235–242 (2016).

    Article  Google Scholar 

  54. 54.

    On the Approximation of the Kolmogorov Complexity for DNA Sequences (2017).

  55. 55.

    Antão, R., Mota, A., Machado, J.A.T.: Kolmogorov complexity as a data similarity metric: application in mitochondrial DNA. Nonlinear Dyn. 93(3), 1059–1071 (2018).

    Article  Google Scholar 

  56. 56.

    Shannon, C.E.: A mathematical theory of communication. Bell System Technical Journal 27(3), 379–423, 623–656 (1948)

  57. 57.

    Gray, R.M.: Entropy and Information Theory. Springer, New York (2011)

    Book  Google Scholar 

  58. 58.

    Beck, C.: Generalised information and entropy measures in physics. Contemp. Phys. 50(4), 495–510 (2009).

    Article  Google Scholar 

  59. 59.

    Khinchin, A.I.: Mathematical Foundations of Information Theory. Dover, New York (1957)

    MATH  Google Scholar 

  60. 60.

    Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. 106(6), 620–630 (1957)

    MathSciNet  Article  Google Scholar 

  61. 61.

    Rényi, A.: On measures of information and entropy. In: Proceedings of the fourth Berkeley Symposium on Mathematics, Statistics and Probability, pp. 547–561. Berkeley, California (1960).

  62. 62.

    Tsallis, C.: Possible generalization of Boltzmann–Gibbs statistics. J. Stat. Phys. 52(52), 479–487 (1988).

    MathSciNet  Article  MATH  Google Scholar 

  63. 63.

    Machado, J.A.T.: Fractional order generalized information. Entropy 16(4), 2350–2361 (2014).

    Article  Google Scholar 

  64. 64.

    Wang, Vemuri, Rao, Chen: Cumulative residual entropy, a new measure of information & its application to image alignment. In: Proceedings Ninth IEEE International Conference on Computer Vision. IEEE (2003).

  65. 65.

    Xiong, H., Shang, P., Zhang, Y.: Fractional cumulative residual entropy. Commun. Nonlinear Sci. Numer. Simul. 78, 104879 (2019).

    MathSciNet  Article  Google Scholar 

  66. 66.

    Sibson, R.: Information radius. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete 14(2), 149–160 (1969)

    MathSciNet  Article  Google Scholar 

  67. 67.

    Taneja, I., Pardo, L., Morales, D., Ménandez, L.: Generalized information measures and their applications: a brief survey. Qüestiió 13(1–3), 47–73 (1989)

    MathSciNet  Google Scholar 

  68. 68.

    Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37(1), 145–151 (1991).

    MathSciNet  Article  MATH  Google Scholar 

  69. 69.

    Cha, S.H.: Measures between probability density functions. Int. J. Math. Models Methods Appl. Sci. 1(4), 300–307 (2007)

    MathSciNet  Google Scholar 

  70. 70.

    Hartigan, J.A.: Clustering Algorithms. Wiley, New York (1975)

    MATH  Google Scholar 

  71. 71.

    Tenreiro, J.A., Machado, A.M.L., Galhano, A.M.: Multidimensional scaling visualization using parametric similarity indices. Entropy 17(4), 1775–1794 (2015).

    Article  MATH  Google Scholar 

  72. 72.

    Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the Surprising Behavior of Distance Metrics in High Dimensional Space. Springer, New York (2001)

    Book  Google Scholar 

  73. 73.

    Sokal, R.R., Rohlf, F.J.: The comparison of dendrograms by objective methods. Taxon 10, 33–40 (1962).

    Article  Google Scholar 

  74. 74.

    Felsenstein, J.: PHYLIP (phylogeny inference package), version 3.5 c. Joseph Felsenstein (1993)

  75. 75.

    Tuimala, J.: A Primer to Phylogenetic Analysis Using the PHYLIP Package. CSC—Scientific Computing Ltd., Espoo (2006)

    Google Scholar 

  76. 76.

    Saeed, N., Haewoon, I.M., Saqib, D.B.M.: A survey on multidimensional scaling. ACM Comput. Surv. CSUR 51(3), 47 (2018).

    Article  Google Scholar 

  77. 77.

    Machado, J.A.T.: Relativistic time effects in financial dynamics. Nonlinear Dyn. 75(4), 735–744 (2014).

    MathSciNet  Article  Google Scholar 

  78. 78.

    Lopes, A.M., Andrade, J.P., Machado, J.T.: Multidimensional scaling analysis of virus diseases. Comput. Methods Programs Biomed. 131, 97–110 (2016).

    Article  Google Scholar 

  79. 79.

    Cyranoski, D.: Profile of a killer: the complex biology powering the coronavirus pandemic. Nature 581(7806), 22–26 (2020).

    Article  Google Scholar 

  80. 80.

    Andersen, K.G., Rambaut, A., Lipkin, W.I., Holmes, E.C., Garry, R.F.: The proximal origin of SARS-CoV-2. Nat. Med. 26(4), 450–452 (2020).

    Article  Google Scholar 

Download references


The authors thank all those who have contributed and shared sequences to the GISAID database ( The authors also thank those who have contributed to the GenBank of the National Center for Biotechnology Information (NCBI) databases ( The authors also thank Rómulo Antão for the help in handling the information with the compressors zlib and bz2.

Author information



Corresponding author

Correspondence to J. A. Tenreiro Machado.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.




Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Machado, J.A.T., Rocha-Neves, J.M. & Andrade, J.P. Computational analysis of the SARS-CoV-2 and other viruses based on the Kolmogorov’s complexity and Shannon’s information theories. Nonlinear Dyn 101, 1731–1750 (2020).

Download citation


  • COVID-19
  • Kolmogorov complexity theory
  • Shannon information theory
  • Hierarchical clustering
  • Multidimensional scaling