1 Introduction

Presently, reliable and assertive data about many real-world phenomena are available for computer processing. One example consists of clinical information about viral diseases. Viruses infections are an important cause of mortality and morbidity. More than 2000 viruses were identified and many can infect humans, or animals [1]. In general, viral diseases have very diverse characteristics and complexity, and computational methods for data mining and feature extraction are relevant strategies to adopt. As usually occurs with real-world data, information is scattered, and exhibits multiple characteristics with distinct levels of relevance. Therefore, it is important to explore reliable algorithms for highlighting the main details, and to take advantage of modern computational resources to visualize the relations embedded within the data.

Herein, we adopt the multidimensional scaling (MDS) technique to compare the relationships among several viruses responsible for human diseases. New schemes for improving the visualization of the MDS charts are proposed. In what concerns selection of the “objects” under study, most are based on their impact on people and visibility in communication media (e.g., subtype H5N1 of Influenza A virus, Ebola, Chikungunya and Zika), others due to historical reasons (e.g., Rabies, Poliomyelitis, and Smallpox), and some because of their incidence and prevalence in humans (e.g., Influenza, Rhinovirus, and Norovirus). The viruses are compared by means of their characteristics and the symptoms of the diseases that they may cause in humans.

The MDS can lead to a new perspective in the study of human pathologies. MDS is a statistical technique for analyzing similarities in information that generates geometric representations for complex objects [2]. MDS appeared in the context of behavioral sciences, for understanding judgments of individuals about features in a set of objects [3, 4]. Presently, the MDS is used in real-world data, such as biological taxonomy [5], finance [6], marketing [7], sociology [8], physics [9], geophysics [1012], communication networks [13], biology and biomedicine [14], among others [15].

The paper is organized as follows. Section 2 introduces the MDS technique. Section 3 studies and compares data characterizing the clinical effects of 21 viruses. Finally, Sect. 4 draws the conclusions.

2 Multidimensional Scaling

We consider s objects defined in a m-dim space, \(\mathcal {M}\), and a proximity measure, \(\delta _{ij}\), between objects i and j. The first step consists of calculating \(\mathbf {C}=[\delta _{ij}]\) (\(\dim(\mathbf {C})=s \times s\)), of item-to-item dissimilarities. The MDS produces a configuration \(\mathbf {X}\) (\(\dim(\mathbf {X})=s \times q\)), where the dimension \(q < m\) is chosen by the user. Thus, \(\mathbf {X}\) attempts to replicate in a low-dimensional space, \(\mathcal {Q}\), the proximities between the s elements in \(\mathcal {M}\). In general, the MDS unveils the embedded data patterns, being different from other techniques [16, 17], not only because it requires no a priori assumptions for each dimension, but also due to its good convergence [18, 19].

To arrive to configuration \(\mathbf {X}\), MDS evaluates different alternative values to minimize some fitness function, such as [20] the raw stress, \(\sigma ^2\):

$$\begin{aligned} \sigma ^2=\displaystyle \sum _{i=2}^s \sum _{j=1}^{i-1} z_{ij}\left( \delta _{ij}-d_{ij}\right) ^2, \end{aligned}$$
(1)

where \(z_{ij}>0\) is a weight and \(d_{ij}\) measures the dissimilarities among the items i and j in the embedding space \(\mathcal {Q}\). Therefore, a distance measure is often adopted for implementing \(d_{ij}\) [21].

Besides (1), there are several stress measures [22], namely, the normalized raw stress, the Kruskal’s stress-1 and stress-2, and the S stress.

To assess the quality of the MDS solutions, it is used the Shepard diagram that represents the pairs \((d_{ij}, \, \delta _{ij})\). The Shepard diagram displays the outliers and residuals resulting from the MDS. A narrow scatter following the 45\(^{\circ }\) line corresponds to a good fit between \(d_{ij}\) and \(\delta _{ij}\).

Another test to the MDS quality is the stress plot that represents \(\sigma ^2\) versus q. The curve \(\sigma ^2(q)\) is monotonic decreasing and the user chooses q as a compromise between reducing \(\sigma ^2\) and having small values of q.

The MDS interpretation focuses on the emerging clusters and considers the distances between points in the produced chart. Therefore, the user can rotate, shift, or zoom the chart, while the distances remain invariant. Usually, \(q=2\), or \(q=3\), is adopted, since they allow a direct graphical representation.

3 Data Analysis and Visualization

We analyze data for \(s = 21\) viruses responsible for infectious diseases. These are \(\{\)Bird Flu, Chicken Pox, Chikungunya, Dengue Fever, Ebola, Hepatitis B, HIV, Marburg disease, Measles, MERS, Mumps, Norovirus, Polio, Rabies, Rhinovirus, Rotavirus, Rubella, SARS, Seasonal Flu, Smallpox, and Zika virus infection\(\}\), with acronyms \(\{\)BFlu, CPox, Chi, Den, Ebo, HepB, HIV, Mar, Mea, MERS, Mum, Nor, Pol, Rab, Rhi, Rot, Rub, SARS, SFlu, Sma, and ZIKV\(\}\).

For the ith virus, \(i = 1,\ldots , \, s\), we associate \(m=5\) quantitative attributes, namely, (i) the fatality rate, (ii) the average basic reproductive number, (iii) the average serial interval, (iv) the incubation period, and (v) the virus survival time outside a host. Table 1 lists the data, where the numerical values correspond to the matrix \(\mathbf {\tilde{U}} = [\tilde{u}_{ik}]\), \(i = 1,\ldots , \, s\), \(k = 1,\ldots , \, m\).

Table 1 Attributes of the viruses

For constructing Table 1, data were obtained from several distinct sources: Influenza A virus, subtype H5N1 (or “Bird Flu”) [2325]; Chicken Pox (varicella-zoster infection) [2628]; Chikungunya [2931]; Dengue Fever [32, 33]; Ebola [3436]; Hepatitis B [3739]; Human Immunodeficiency Virus (HIV) [4042]; Marburg hemorrhagic fever [36, 43]; Measles [4447]; Middle East Respiratory Syndrome (MERS) [4850]; Mumps [51, 52]; Norovirus [53, 54]; Poliomyelitis [5557]; Rabies [5860]; Rhinovirus [6163]; Rotavirus [6467]; Rubella [46, 68]; Severe Acute Respiratory Syndrome (SARS) [49, 69]; Seasonal flu [25, 70, 71]; Smallpox [72, 73]; Zika virus disease [74, 75].

3.1 MDS Analysis using the Arc-cosine Distance

Previous to applying MDS, the data are “normalized” to avoid saturation effects of the numerical values. Therefore, the elements of each column of matrix \(\mathbf {\tilde{U}}\) are converted to the interval [0, 1], producing the data matrix \({\mathbf {U}}\). The vectors of features for item-to-item comparison correspond to the lines of \({\mathbf {U}}\) and will be denoted by \(\mathbf {u}_i\).

Various distance measures were tested for constructing the matrix \(\mathbf {C}\). Here, we present results for the arc-cosine distance, \(\delta _{ij}\), since it leads to charts that are easy to interpret. Other distances are possible and have also been used in distinct applications [6, 12], but several numerical tests confirmed that the arc-cosine leads to reliable results. Therefore, for items i and j\((i, \, j = 1,\ldots , \, s)\), we have

$$\begin{aligned} \delta _{ij}=\arccos \left( \frac{\displaystyle \sum _{k=1}^{m} \alpha _k^2u_{ik} u_{jk}}{\sqrt{\displaystyle \sum _{k=1}^{m} \alpha _k^2 u_{ik}^2 \cdot \displaystyle \sum _{k=1}^{m} \alpha _k^2 u_{jk}^2}}\right) , \end{aligned}$$
(2)

where \(\alpha _k>0\), \(k=1,\ldots , \, m\), represent weights specified by the user. Given expression (2), the matrix \(\mathbf {C} = [\delta _{ij}]\) can be computed for feeding the MDS.

Figure 1 represents the 2D and 3D charts (\(q=2\) and \(q=3\)) resulting from the MDS using the weights \(\alpha _k = \{5,\,2,\,1,\,1,\,1\}\), where the points represent viruses. The relationships between the items are inferred from the coordinates of the points. Objects that are similar (dissimilar) appear closer (farther) to each other in space.

With alternative distances, we capture different characteristics of the phenomena that yield distinct plots, but in general lead to identical conclusions. A “good” distance is the one that produces a MDS reflecting the real-world phenomenon in a direct and clear visualization.

Fig. 1
figure 1

MDS charts resulting from the arc-cosine distance \(\delta _{ij}\), \(q=2\) and \(q=3\)

Figure 2 depicts the Shepard diagram for \(q=1,\ldots , \, 5\) and the stress plot. The Shepard diagram depicts a good scatter of points around the 45\(^{\circ }\) line for \(q \ge 3\), demonstrating a good fit between the distances and the dissimilarities. The curvature of the stress plot is often adopted for deciding the value of q. In this case, we observe that \(q =2\) is insufficient, \(q=3\) seems to be a good choice, and \(q> 3\) leads to limited improvements. However, if we adopt \(q=3\) the question remains of visualizing efficiently the MDS information, since for 3D representations, we often have to zoom, shift, and rotate the MDS graph to perceive assertively the real location of the objects in space. This question will further discussed in Sect. 3.3.2.

Fig. 2
figure 2

Quality of the MDS solution for the arc-cosine distance \(\delta _{ij}\) assessed by the Shepard diagram for \(q=1,\ldots , \, 5\) and the stress plot

Before continuing, two numerical aspects need to be clarified: the weights \(\alpha _k\) used and the missing data in Table 1. The weights were tuned for highlighting the importance of the features recognized as being more harmful from the medical point of view: first, the fatality rate and, second, the average basic reproductive number. However, the question remains on how to choose \(\alpha _k\). In this perspective, we performed several experiments varying the weights. Figure 3 depicts the results obtained with four distinct sets of values, namely, \(\alpha _k = \{1,\,1,\,1,\,1,\,1\},\)\(\alpha _k = \{2.5,\,1.5,\,1,\,1,\,1\}\), \(\alpha _k = \{5,\,2,\,1,\,1,\,1\}\) and \(\alpha _k = \{7.5,\,2.5,\,1,\,1,\,1\}\). For each set \(\alpha _k\), we generate one MDS chart, and afterwards, the charts are combined using Procrustes analysis [76]. Procrustes involves the operations of translation, reflection, orthogonal rotation, and scaling, to best conform the points in a given matrix under modification in relation with the points of a reference matrix.

In our case, we (i) choose the first chart for initial reference, (ii) use Procrustes to superimpose the next MDS chart into the current reference, (iii) make the current set of superimposed charts the new reference, and (iv) continue to step (ii) until all charts have been conformed. The results obtained reveal identical patterns, meaning that the method is robust to distinct values of \(\alpha _k\).

Fig. 3
figure 3

MDS global chart for \(q=2\) and the arc-cosine distance \(\delta _{ij}\), obtained by Procrustes with four different sets of weights \(\alpha _k\)

In Fig. 1, the unknown data, denoted by ‘-’ in Table 1, are considered zero. Therefore, these values do not contribute to the distance used for comparing items. Moreover, as the missing data occur only in four values of the less weighted features, their influence is not as significant as for the rest of the information. In addition, as will be shown in Sect. 3.2, the results reveal small sensitivity to possible noise in the data, which includes the uncertainty in the unknown values that were set to zero. Nonetheless, a different criterion for dealing with that problem could be adopted. Experiments with the missing data replaced not only by zero, but also by the minimum, average, and maximum values in the third and fifth columns of Table 1 led to results qualitatively similar, as depicted in Fig. 4, revealing the effectiveness of the criterion adopted.

Fig. 4
figure 4

MDS global chart for \(q=2\) and arc-cosine distance \(\delta _{ij}\), obtained by Procrustes with missing data replaced by zero, minimum, average, and maximum values of the third and fifth features

3.2 Sensitivity Analysis

The 21 viruses were compared in the perspective of quantitative features. However, the data diverge slightly, depending on factors such as the time of the study or the operational conditions, namely, environmental conditions, geographic region, development level, or medical assistance. Therefore, we analyze here the sensitivity results with respect to the input data.

We start by adding random noise to the quantitative features, \(k=1,\ldots , \, 5\), with amplitude \(\pm 10\%\) of the values in Table 1. Moreover, any negative values are avoided by limiting numbers to zero. A sample of 50 experiments, each yielding one MDS chart, is performed and the charts are combined using the Procrustes scheme.

Figure 5 illustrates the MDS chart for \(q=2\) produced by the Procrustes algorithm. We verify that the method has low sensitivity to variations in the quantitative features, since the location of the points reveals minor variations.

Fig. 5
figure 5

MDS global chart for \(q=2\) and arc-cosine distance, \(\delta _{ij}\), generated by Procrustes with random variations added to the values of the five features

3.3 Data Clustering and Visualization

The MDS interpretation focuses on the distances between points in the produced charts. For identifying clusters, we can adopt some kind of ad hoc strategy based on the direct visualization of the MDS plots, or we can implement an algorithm for obtaining an automatic clustering. In addition, the configuration, \(\mathbf {X}\), produced by the MDS tries to replicate, in the low-dimensional space, \(\mathcal {Q}\), the original proximities between pairwise elements. For \(q=2\), this leads to a direct visualization, but neglects the information described in the higher dimensional components of \(\mathbf {X}\). In this line of thought, in the next subsections, we introduce the non-hierarchical clustering algorithm K-means for automatic cluster identification and we propose a technique for an improved visualization of MDS information in the 2D space by embedding information of the extra dimensions.

3.3.1 The K-Means Clustering

Clustering is a technique that groups objects similar to each other in some sense. The K-means is a non-hierarchical clustering algorithm [77] that starts with a set of s objects, where each one is represented by a point in a q-dim space, and a certain number of clusters, K, specified in advance. The K-means groups the s objects into \(K \le s\) clusters, to minimize the sum of distances between the points and the centers of their clusters. The K-means produces a solution where objects in a cluster are close to each other and far from objects in other clusters.

An important issue in K-means is to specify K, since the notion of “good clustering” is subjective. Nevertheless, we can adopt different measures for assessing the quality of the solution, such as the Calinski-Harabasz, Davies-Bouldin, and silhouette [78].

Here, we consider the silhouette, S, to assess if an object lies “adequately” within its cluster. The silhouette varies in the interval \(S \in [\!-1,1]\), so that values close to \(\{-1,0,1\}\) correspond to \(\{incorrect, neutral, correct\}\) object assignments.

Knowing the coordinates of the \(s=21\) objects produced by the MDS in the \(q=3\) dim space, we assess the quality of the clusters in the interval \(K \in [2,6]\). Figure 6 depicts the corresponding silhouettes and the mean value for each cluster (blue marks). The optimum value is obtained \(K=4,\) corresponding to the maximum silhouette mean value \(S_M = 0.77\).

Fig. 6
figure 6

Silhouettes assessing the quality of the clustering for \(K \in [2,6]\), the arc-cosine distance \(\delta _{ij}\), and \(q=3\). The blue marks depict the mean silhouette value for each cluster

For \(K=4\), the clusters are \(\mathcal {A} = \{\)CPox, Mea, Mum, Nor, Rhi, Rot, Rub, SFlu\(\}\), \(\mathcal {B} = \{\)HepB, HIV, Rab\(\}\), \(\mathcal {C} = \{\)BFlu, Ebo, Mar, MERS, Pol, SARS, Sma\(\}\) and \(\mathcal {D} = \{\)Chi, Den, ZIKV\(\}\). These clusters are further discussed in the next subsection.

3.3.2 Improved Visualization in 2D Space

The geometrical shape of the chart produced by MDS varies significantly with the distance measure adopted for quantifying the distances between items. However, this characteristic does not precludes that we use the MDS chart taking full advantage of all its properties. Consequently, we may interpret the collection of points as “samples” of an abstract locus corresponding to the projection of the m initial dimensions into a lower dimensional (abstract) space.

We adopt a scheme that allows for a direct visualization of the MDS, while including information up to \(q=3.\) Therefore, we approximate the dimension \(x_3\) of \(\mathbf {X}\) with a contour generated by means of a linear radial basis function interpolation [79]. Moreover, we improve the identification of patterns by superimposing a tree in the MDS chart. The nodes of the tree are the \(s=21\) points representing items (viruses). In a first phase, we connect the group of points that are closer in the MDS chart producing the sets, \(\mathcal {P}\), of interconnected points (nodes). In a second phase, the sets, \(\mathcal {P}\), are compared through the distances between their constitutive nodes. The distance can be calculated taking into account any number \(p<m\) of \(\mathbf {X}\) components. A connection is established in the q-dim chart, only between the two closest nodes (i.e., \(\mathcal {P}_i\) and \(\mathcal {P}_j\)). This calculation generates a second level of interconnection, and the scheme is repeated iteratively until there is a continuous route between all points. Therefore, the interpretation of the MDS chart is based not only in the relative position of the points, but also in the structure interconnecting them.

Figure 7 depicts a projection of the MDS chart for \(q=2,\) the contour that approximates the dimension \(x_3\), and the superimposed interconnections generated by calculating the distances between objects with \(p=5\). We observe easily the four clusters identified in the previous subsection. Moreover, we verify that the proposed methodology leads to a clear visualization and produces a richer chart of the objects.

Fig. 7
figure 7

MDS chart for \(q=2\) and the arc-cosine distance \(\delta _{ij}\). The contour represents the dimension \(x_3\) and the superimposed tree allows for an easier identification of patterns

In synthesis, besides the observation based on the relative distances in 2D space, we now verify that the ZIKV has a relevant position along the \(x_3\) dimension, somehow strengthening the characteristics revealed by the Chikungunya and Dengue.

3.3.3 Discussion of the Results

The clusters \(\{\mathcal {A}, \, \mathcal {B}, \, \mathcal {C}, \, \mathcal {D}\}\) do not follow an epidemiological line of thought, but may be of medical value, since they reflect characteristics measured by health care practice. In cluster, \(\mathcal {A}\) are included viruses of Risk Group 2 that in general do not cause serious illness nor life threatening.

In cluster \(\mathcal {B}\), we find the Lentivirus that is responsible for HIV and acquired immunodeficiency syndrome (AIDS), a Risk Group 3 agent. We find also the Hepatitis B and the Rabies virus, a Lyssavirus genus and Rhabdoviridae family virus, of Risk Group 2.

In cluster \(\mathcal {C}\), we can consider two subclusters. The first subcluster includes the Ebola and Marburg viruses that belong to the Risk Group 4. In addition, in this subcluster, the agents responsible for MERS and Bird flu are classified as Risk Group 3. The second subcluster includes viruses of different Risk Groups, namely, the Polio virus and the SARS–associated coronavirus, belonging to Risk Groups 2 and 3, respectively. Smallpox is also present [80].

Cluster \(\mathcal {D}\) includes Chikungunya, considered a Risk Group 3 pathogen. Also included in \(\mathcal {D}\) are the Dengue fever virus, a Risk Group 2 arbovirus pathogenan, and ZIKV, recognized as being similar to Chikungunya and Dengue viruses.

In conclusion, we verified that the MDS provides a powerful computational visualization technique of viruses data and the obtained charts may be of medical interest in the scope of present and future viral outbreaks.

4 Conclusions

This paper discussed the computational analysis of real-world data describing viruses main quantitative characteristics. By encompassing complex scattered data, researchers have to choose between comparing all aspects and detecting the main properties. This problem represents a challenge since some information (or its absence) may lead to incomplete or eventually to incorrect conclusions. Therefore, complex information calls for computational and visualization tools capable of revealing the most relevant issues. The MDS technique was adopted, leading to substantive results that follow present-day scientific knowledge.