Improving the energy efficiency of SMACOF for multidimensional scaling on modern architectures


The reduction of the dimensionality is of great interest in the context of big data processing. Multidimensional scaling methods (MDS) are techniques for dimensionality reduction, where data from a high-dimensional space are mapped into a lower-dimensional space. Such methods consume relevant computational resources; therefore, intensive research has been developed to accelerate them. In this work, two efficient parallel versions of the well-known and precise SMACOF algorithm to solve MDS problems have been developed and evaluated on multicore and GPU. To help the user of SMACOF, we provide these parallel versions and a complementary Python code based on a heuristic approach to explore the optimal configuration of the parallel SMACOF algorithm on the available platforms in terms of energy efficiency (GFLOPs/watt). Three platforms, 64 and 12 CPU-cores and a GPU device, have been considered for the experimental evaluation.


Real-world data, such as speech signals, images, biomedical, financial, telecommunication and other data usually have a high dimensionality as each data instance (point) is characterized by a set of features. The dimensionality of such data, as well as the amount of data to be processed, is constantly increasing therefore the requirement of processing these data within a reasonable time frame still remains an open problem. Dimensionality reduction methods which aim to map high-dimensional data into a lower-dimensional space play extremely important role when exploring large datasets. Among such methods multidimensional scaling (MDS) remains one of the most popular [2, 8].

One of the dimensionality reduction applications is a graphical visualization of the structure of the high-dimensional data in 2D or 3D space for easier data understanding. Some applications in this line can be found in [12, 18, 22]. Moreover, MDS has proven to be useful as a technique to evaluate criteria of objects classification [14] or discover criteria which initially had not been taken into account [1], serving as a psychological model that allows to discover human patterns [15].

A well-known algorithm for MDS is called SMACOF (Scaling by Majorizing a COmplicated Function) [7]. The experimental investigation has demonstrated that SMACOF is most accurate algorithm comparing to others [16]. It should be noted that the SMACOF algorithm is the most expensive, as its complexity is \(O(m^2)\), where m is the number of observations. Several different approaches have been developed to reduce computational complexity of the MDS techniques. In [23], the complexity was reduced to \(O(m \sqrt{m})\) by developing iterative MDS spring model. In [32], authors reduced the complexity to \(O(m \log {m})\) by dividing the original matrix into sub-matrices and then combining the sub-solutions to obtain a final solution. The improved versions of MDS reduce complexity insignificantly, however, optimization accuracy suffers [16]. Consequently, SMACOF version of MDS is usually chosen as it ensures the sufficient accuracy that is essential in many dimensional cases. In short, the MDS techniques remain in high time complexity order therefore parallel strategies should be considered to accelerate the computation of the MDS procedure [24].

During the last decade, the high-performance computing (HPC) has greatly improved and has been widely applied for MDS techniques. In [29], authors proposed a MDS parallel implementation and explored it under MPI and other libraries. In [11], Fester et al. proposed a CUDA implementation of MDS algorithm based on the high throughput multidimensional scaling (HiT-MDS). In [28], authors suggested a new efficient parallel GPU algorithm for MDS based on virtual particle dynamics [9] and experimentally compared it with multicore CPU version. In [16], the multilevel MDS Glimmer algorithm was developed for GPU by dividing the input data into hierarchical levels and executing the algorithm recursively. It must be noted that currently Glimmer is the most well-known and widely used GPU tool for MDS. Another CUDA-based technique to get MDS approximation is CFMDS [27] that implements both single-level and multilevel approaches.

In [26], authors proposed a correlation clustering framework which uses MDS for layout and GPU-acceleration to speedup visual feedback. In [25], GPU version of MDS was developed to improve content-based image retrieval (CBIR) systems. Summarizing, the research on this HPC field is being carried out actively; it remains relevant as the new GPU architecture and heterogeneous platforms constantly appear, and should be effectively exploited for solving dimensionality reduction problems of different complexities.

Currently, the target of HPC includes the optimization of energy consumption. The ratio of the computational speed to the electrical power (GFLOPs/watt) is usually defined as a parameter that is a suitable indicator of the energy efficiency [19]. The increase in this parameter means that the system achieves better performance (GFLOPs) with less electrical power (watts) and, as consequence, less energy is consumed. Therefore, for the optimal parallel executions of SMACOF, the ratio should be maximized.

In this paper, parallel versions of the SMACOF algorithm on multicore and GPU are developed and evaluated on prototypes of modern architectures. As the parallel SMACOF algorithm can be executed on different alternative platforms, the kind of platform and its resources that optimize the runtime and/or energy efficiency need to be determined. Bearing in mind that the parallel performance depends on the problem sizes, the users of parallel SMACOF need support to configure it. For this purpose, a benchmarking process to find the optimal solutions has been developed. It is based on a heuristic approach which combines two concepts: the analysis of the first iterations of SMACOF representative computation and functional models of performance and power consumption of homogeneous parallel platforms. The benchmarking process has been evaluated using different platforms (multicore and GPU) and various sizes of the problem. Moreover, the energy efficiency of SMACOF has been experimentally evaluated on two different multicore platforms and a GPU device.

The paper is organized as follows. In Sect. 2, the descriptions of the Multidimensional Scaling and the SMACOF algorithm are provided. Section 3 describes the proposed multicore and GPU parallel implementations of the SMACOF algorithm. In Sect. 4, the algorithm for tuning the energy efficiency of SMACOF is presented. Experimental evaluations of the parallel implementations on three platforms are discussed in Sect. 5. Finally, conclusions are drawn in Sect. 6.

SMACOF algorithm for MDS

Multidimensional scaling is a technique for the analysis of similarity or dissimilarity data on a set of objects (items). It aims at finding points \(Y_1, Y_2, \dots , Y_m \equiv Y\) in the low-dimensional space \(\mathbb {R}^s,\ s<n,\) such that the distances between them are as close as possible to the distances between the original points \(X_1, X_2, \dots , X_m \equiv X\) in the space \(\mathbb {R}^n\). This is achieved by minimizing the stress function:

$$\begin{aligned} E_\mathrm{MDS} = \sum _{i<j}{\Big (\delta _{ij}-d(Y_i,Y_j)\Big )^2} \end{aligned}$$

Here, \(d(\cdot ,\cdot )\) (\(\delta \)) is the distance between two points in the low-dimensional space (multidimensional space).

There are many strategies to solve MDS problems [8]. We focus our attention on the well-known SMACOF algorithm which is based on a particular minimization process of the stress function [7]. The theoretical background of SMACOF is simpler and more powerful than other approaches from convex analysis, because it guarantees monotone convergence of stress [2]. SMACOF has demonstrated better results when optimizing stress function comparing to other proposals in the literature [16]. The main idea is based on the majorizing concept which consists in approximating a complex function by another one simpler. This method iteratively finds a new function, which is located above the original function and touches at the supporting point. At every iteration of the algorithm, the minimum of the new function is closer of the minimum of the complex function, in our case the stress function [2]. SMACOF can be expressed by Algorithm 1 in which the complexity order of the most relevant tasks appeared between parenthesis.


Algorithm 1 has a high computational cost and high memory requirements due to the large data structures involved: input matrix \(\varDelta \) (\(m \times m\)), output and auxiliary matrices (\(m \times s\)) and three auxiliary matrices (\(m \times m\)) to store the similarities among the objects of the low-dimensional space. The symmetry has not been exploited in the storage of the data structures; however, it has been considered for the above-mentioned matrices update. Bearing in mind this fact, the number of floating point operations of Algorithm 1 is: \(3s/2 m^2+3s/2 m\) for the initialization (line 2 of Algorithm 1) and \((7/2s+3/2)m^2+1/2(3s+1)m\) for the iterative process.

Parallel implementations of the SMACOF algorithm

The SMACOF computational cost is \(O(s \cdot m^2)\) and memory requirements are \(O(m^2)\). This feature has limited for years the applicability of SMACOF to solve large MDS problems. The use of HPC techniques helps to overcome this drawback. In this work, we propose two parallel versions based on the exploitation of large-scale modern multicore and GPU architectures. This section is devoted to describing these parallel implementations.

Both implementations are focused on the parallel execution of the computation of the Euclidean distances matrices (lines 2 and 7 of Algorithm 1) and the Guttman transform (lines 5 and 6 of Algorithm 1). Parallel procedures are highlighted in bold in Algorithm 1. To calculate the outputs of these procedures, we have taken into account that we are working with symmetric matrices (\(B^k\), \(D^k\) and \(\varDelta \)). For example, to compute the symmetric matrix \(B^k\) (which defines Guttman transform) is only necessary to calculate a triangular sub-matrix of \(L=(m \cdot (m + 1) / 2)\) elements. Thus, \(B^k\) can be managed as a unidimensional vector of L elements which can be updated in parallel. To distribute this computation among the processing elements, the left part of Algorithm 2 has been transformed into the right one. This way, two nested loops are collapsed into a regular loop to compute the triangular matrix of L elements. It can be easily parallelized with maintaining the load balance. This idea has also been applied to the parallel computation of \(D^k\).

The multicore version has been implemented using C, OpenMP [3] and MKL library [17]. The parallel computations of \(B^k\) and \(D^k\) consider the symmetry of these matrices. Therefore, Algorithm 2 on the right is taken as reference for the parallel computation of \(B^k\). The l-loop of such algorithm is distributed among the cores and when it has finished a synchronization point is included to ensure that the non-diagonal elements of \(B^k\) are computed before starting the parallel i-loop. Moreover, the MKL library (concretely the \(cblas\_dgemm\) routine) is in charge of computing in parallel the matrix-matrix product linked to the Guttman transform (line 6 of the Algorithm 1).

In the GPU version, three kernels have been coded using C and CUDA to compute in parallel \(D^k\) (lines 2 and 7 of Algorithm 1) and \(B^k\) (line 5 of Algorithm 1). The Euclidean distances require one kernel, and the Guttman transform requires two, as it is explained below. To compute the distances matrix, every thread updates two symmetric elements of \(D^k\) matrix. Moreover, shuffle instructions have been used for the reductions involved in the computation of \(D^k\) elements. These instructions, available from Kepler NVIDIA architecture, essentially allow threads in the same warp to share information. They can improve the reduction processes [6]. In our experiments, shuffle instructions have demonstrated to improve the performance compared to the reductions based on shared memory. We have observed that the advantage of shuffle instructions versus the shared memory version increases with s. Specifically, we have evaluated the performance for sizes of problem from \(m=10{,}000\) to \(m=40{,}000\) with \(s=64\) and the shuffle version has obtained the same or better performance (up 30\(\%\)) than shared memory version in the computation of \(D^k\) matrix (lines 2 and 7 of Algorithm 1).

The CUDA version of Algorithm 2 to compute \(B^k\) on GPU consists of two kernels. In the first kernel, each thread starts by calculating a non-diagonal element of \(B^k\). Next, its symmetric element is copied without requiring any synchronization. When this kernel finishes, the second one computes the diagonal elements from the non-diagonal ones. For \(Y^k\), cublasDgemm routine from the cuBLAS library [5] has been used to accelerate matrix-matrix product on GPU (line 6 of Algorithm 1).

Tuning the energy efficiency of the SMACOF algorithm

In this work, two parallel implementations have been developed to accelerate the SMACOF algorithm. When solving real-world problems, it is reasonable to run the most energy efficient parallel SMACOF version on a particular subset of resources of available computational platforms.

The idea consists in an initial benchmarking that identifies, for every available platform, the optimal selection of resources for a size of problem of interest. Then, the user can choose the optimal platform for the subsequent execution of the SMACOF algorithm. According to the developed parallel versions, multicore processors and GPUs are considered as target platforms in this work.

The energy efficiency (EE) is usually defined as the ratio of the computational speed to the electrical power, that is GFLOPs/watt [19]. Therefore, for the optimal parallel executions of SMACOF, the ratio should be maximized.

The optimization of the EE of parallel applications on modern platforms can be viewed as a problem of scheduling parallel machines with costs [30]. The parallel SMACOF versions can be executed on one of the alternative platforms, for example the different multicore or GPUs architectures. Every platform is denoted by \(\mathcal {F}^k \in \mathcal {F}\), \(k=1,\ldots , f\) where \(\mathcal {F}\) is the set of f available parallel platforms. Every platform \(\mathcal {F}^k\) consists in a set of parallel machines \(\mathcal {M}^k\), \(\mathcal {F}^k=\{ \mathcal {M}^k_i \}_{i=1}^{c_k}\) where \(c_k\) is the number of available machines of the platform k. The corresponding energy efficiency depends on the number of machines involved in the computation and the particular input size.

Then, the solution of the scheduling problem corresponds to the subset of platforms \(\mathcal {F}^{k_o} \subseteq \mathcal {F} \) with their optimal configurations defined by the machines number \(r^o_{k_o}\) that optimizes EE (\(r^o_{k_o} \le c_{k_o}\)). We propose a heuristic approach for solving this problem. It is based on a functional model of EE for modern platforms (multicore and GPUs) and the definition of the significant computation in the SMACOF algorithm.

The functional performance models were introduced by Lastovetsky [4, 33]. The processor performance depends on the problem size and can be empirically estimated by a benchmarking process. In this way, the modeling performance depends on the combination of the architecture and the application. In similar lines, other authors have been focussed on the benchmarking and they have proposed the concepts of application signature and small-scale executions [10, 31]. If the parallel application is iterative, then a subset of iterations can define the significant portion of the application and can be used in the benchmarking [20, 21].

The models to estimate EE have to combine performance and power. Previous works have proposed functional models for the EE estimation on iterative applications [13]. If it is focused on a particular execution of the application with F floating point operations on one homogeneous platform k, and it is assumed a perfect load balance among \(r_k\) actives machines, then the following model of EE as function of \(r_k\) is reasonable:

$$\begin{aligned} { EE}(r_k)=\frac{F}{\mathcal {T}^k(r_k) {{\mathcal {P}}}^k(r_k)}=\frac{F}{\left( {\mathcal {T}^k(1) \over r_k} + {\mathcal {TC}}^k(r_k)\right) \left( {\mathcal {P}}^k_{idle}+r_k p^k(r_k)\right) } \end{aligned}$$

where \(\mathcal {T}^k(r_k)\) and \({\mathcal {P}}^k(r_k)\) are the runtime and power consumption on \(r_k\) machines respectively, \({\mathcal {TC}}^k(r_k)\) represents the runtime penalties due to the contention among the actives machines on the k platform, \({\mathcal {P}}^k_{idle}\) represents the idle power consumption when no process is actively using any machine and \(p^k(r_k)\) is the contribution to the power of every machine.

According to this model \(\mathcal {T}^k(r_k)\), one minimum for a number of active machines is obtained since \({\mathcal {TC}}^k(r_k)\) is an increasing function and \({\mathcal {P}}^k(r_k)\) is also an increasing function for \(r_k\). Therefore, \(EE(r_k)\) achieves a maximum for \(r^o_{k_o}\) machines.

Then, from the point of view of the SMACOF usage, to optimize EE, it should be identified \(r^o_{k}\) on the set of available platforms for the sizes problem and choose the platform \(k_o\) which optimizes EE, i.e. achieves \(EE(r^o_{k_o})\). Modern computers provide two different platforms, multicore processors and GPUs and the number of kinds of platforms can increase if clusters of heterogeneous nodes with several kinds of multicore and GPUs platforms are available. We have defined a heuristic to decide what is the best platform to run their particular instances of the parallel SMACOF. Our proposal is organized in two stages, first the identification of the optimal configuration of every platform and second the selection of optimal platforms and configurations. Previous considerations about the EE model help us to define an efficient benchmarking exploration to find the optimal configurations on every platform. Therefore, selective search described in Algorithm 3 can be used to find the optimal platforms and their configurations in the benchmarking process.

As above mentioned, the benchmarking is usually based on the execution of a significant core of the application. SMACOF consists in an iterative procedure to compute the Guttman transforms. The computational cost of every iteration is the same; therefore, a subset of iterations can be considered as the SMACOF significant core to compute the profiling in a efficient way. SMACOF can be configured using the information provided by this preprocess based on exploration of several resources selection on particular combinations of platforms and data sizes.


Experimental evaluation

In this section, the SMACOF algorithm to solve MDS problems is evaluated in terms of runtime and energy efficiency on three computational architectures:

\(\mathcal {F}_1\) :

: Bullion S8: 4 Intel Xeon E7 8860v3 (16 \(\times \) 4 CPU-cores);

\(\mathcal {F}_2\) :

: Bullx R421-E4 Intel Xeon E5 2620v2 (12 CPU-cores and 64 GB RAM);

\(\mathcal {F}_3\) :

: NVIDIA K80 (composed by two Kepler GK210 GPUs) connected to the host Bullx R421-E4 Intel Xeon E5 2620v2.

\(\mathcal {F}_1\), \(\mathcal {F}_2\) and \(\mathcal {F}_3\) run Ubuntu 16.04 LTS and \(\mathcal {F}_3\) runs CUDA Toolkit 8. The programs have been compiled using gcc 5.4.0 and nvcc 8.0.44 with optimization flags O3 for GPU architecture 3.5. For the acquisition of energy measurement data, we have collected this information from various hardware counters. For Intel, we have used the Running Average Power Limit (RAPL) interface and, for NVIDIA, the NVIDIA Management Library (NVML).

For the evaluation of SMACOF, test problems of different sizes defined by values of m, n and s have been considered (see Table 1). For this experimental investigation, randomly generated input data were used. The number of evaluated iterations of the SMACOF algorithm has been 100.

Table 1 Test problems using several number of items (m), dimensions of multidimensional space (n), and dimensions of low-dimensional space (s)
Fig. 1

Runtime (top), power (middle) and energy efficiency (bottom) of the set of test problems (Table 1) on \(\mathcal {F}_1\) and \(\mathcal {F}_2\) platforms

Figure 1 shows, the runtime, power and energy efficiency of the set of test problems on \(\mathcal {F}_1\) and \(\mathcal {F}_2\) (multicore) platforms and Table 2 shows similar parameters on \(\mathcal {F}_3\) (GPU) platform with the same test cases. Execution times of the multicore versions (plotted on the top of Fig. 1) are according with runtime models described in Sect. 4. The runtime decreases with the values of \(r_1\) and \(r_2\); therefore, the best performance is achieved for the maximum number of cores. The experimental power measurements are plotted in the middle of Fig. 1. It is remarkable that the temporal evolution of the power partially depends on unpredictable factors for programmers. To overcome this drawback, it has been necessary to collect the measurements after an activity period on the processor to minimize their variance due to changes in the temperature. This instability can be observed in the power plot for both platforms, but we can conclude that power consumption trend increases as the number of cores and the size of the problem.

Focusing our attention on the energy efficiency (plotted on the bottom of Fig. 1), it increases as the number of cores. The highest values of \(r_1\) and \(r_2\) optimize the energy efficiency for the plateau in the plot. Therefore, the optimal value of \(r_k\) in both platforms is in a wide interval, for instance 32–64 (10–12) for \(\mathcal {F}_1\) (\(\mathcal {F}_2\)).

To choose the optimal platform, we could compare the three platforms in terms of performance. This way, the best option for T11 is \(\mathcal {F}_1\), since the execution times are 46.6, 96.2 and 91.5 s on \(\mathcal {F}_1\) with \(r^o_1=64\) , \(\mathcal {F}_2\) with \(r^o_2=12\) and \(\mathcal {F}_3\), respectively. This selection is the same for all test cases. If we focus on the energy efficiency, the best option is the GPU when the problem size is enough high since it consumes less power than \(\mathcal {F}_1\) and achieves a reasonable performance. Then, to optimize the energy efficiency, the best option is the use of the GPU platform for the test cases \(T4-T11\). For instance for T11, the energy efficiencies on the different platforms are 85.5, 176.1 and 196.6 GFLOPs/watt for \(\mathcal {F}_1\), \(\mathcal {F}_2\) and \(\mathcal {F}_3\), respectively. The best platform for \(T1-T3\) is the multicore \(\mathcal {F}_2\) which consumes less power than \(\mathcal {F}_1\).

Table 2 Runtime, power and energy efficiency of the set of test problems (Table 1) on \(\mathcal {F}_3\) (GPU) platform

These results support the benchmarking process explained in Sect. 4 to explore in an automatic way the selection of the optimal parallel platform and its best resource selection. This procedure has been developed in Python. We have chosen \(\hbox {sampling}=3\) to obtain relevant differences between successive experimental evaluations on both platforms. The results support the idea of starting the benchmarking process by the highest numbers of CPU-cores available on every platform to find the optimal \(r_k\) is efficient. To illustrate the behavior of the benchmarking (Algorithm 3) for multicore platforms, we focus on the T11 test. Table 3 shows the EE obtained when a set of ten iterations of SMACOF are executed on platforms \(\mathcal {F}_1\) and \(\mathcal {F}_2\). Only two samples for the benchmarking exploration are required for T11 since \(r^o_1=64\) and \(r^o_2=12\) are identified by the preprocess. So, we can conclude that the proposed benchmarking can execute an efficient exploration to optimize the energy efficiency of parallel SMACOF.

Table 3 Sampling of EE for T11 test according to the benchmarking proposed in Algorithm 3 for multicore platforms \(\mathcal {F}_1\) and \(\mathcal {F}_2\)


This work has analyzed an approach to optimize the energy efficiency (GFLOPs/watt) of the SMACOF algorithm, a well-known and precise method to solve MDS problems. Two parallel versions of SMACOF, multicore and GPU, have been developed and evaluated. To help the user of SMACOF parallel codes, we provide these versions and a complementary Python code based on a heuristic approach to explore the optimum configuration of the available platforms.

An experimental evaluation has been carried out on three platforms based on architectures with 64CPU-cores, 12CPU-cores and a GPU device. The results show 64-cores processor is the best platform to optimize the runtime of SMACOF; the 12-cores processor is the best option to improve the energy efficiency for the smallest test problems and, for the largest test problems, the optimal energy efficiency is achieved on the GPU.

In currently known parallel versions of SMACOF, only the runtime is considered, and neither the energy consumption nor adaptive capability to the platform and problem size are optimized. Therefore, our SMACOF implementation is of great interest for developing energy efficiency aware applications based on MDS problems. Our implemented versions of the SMACOF algorithm are freely available through the following website: As future work, we consider to implement a distributed parallel version of SMACOF and to analyze and develop other methods for solving MDS problems.


  1. 1.

    Bilsky W, Borg I, Wetzels P (1994) Assessing conflict tactics in close relationships: a reanalysis of a research instrument. In: Hox JJ, Mellenbergh GJ, PG Swanborn (eds.) Facet Theory. Analysis and design, SETOS, Zeist, pp 39–46

  2. 2.

    Borg I, Groenen PJ (2005) Modern multidimensional scaling: theory and applications. Springer, Berlin

    Google Scholar 

  3. 3.

    Chapman B, Jost G, Pas Rvd (2007) Using OpenMP: portable shared memory parallel programming (scientific and engineering computation). The MIT Press, Cambridge

    Google Scholar 

  4. 4.

    Clarke D, Ilic A, Lastovetsky A, Rychkov V, Sousa L, Zhong Z (2014) Design and optimization of scientific applications for highly heterogeneous and hierarchical HPC platforms using functional computation performance models. Wiley, Hoboken, pp 235–260

    Google Scholar 

  5. 5.

    cuBLAS library (2017) Accessed 24 Feb 2018

  6. 6.

    CUDA Pro Tip: Do The Kepler Shuffle (2017) Accessed 24 Feb 2018

  7. 7.

    De Leeuw J (1977) Applications of convex analysis to multidimensional scaling. In: Recent developments in statistics, North Holland Publishing Company, pp 133–145

  8. 8.

    Dzemyda G, Kurasova O, Žilinskas J (2013) Multidimensional data visualization: methods and applications, vol 75. Springer, Berlin

    Google Scholar 

  9. 9.

    Dzwinel W, Blasiak J (1999) Method of particles in visual clustering of multi-dimensional and large data sets. Future Gener Comput Syst 15(3):365–379

    Article  Google Scholar 

  10. 10.

    Escobar R, Boppana RV (2016) Performance prediction of parallel applications based on small-scale executions. In: 2016 IEEE 23rd HiPC, pp 362–371

  11. 11.

    Fester T, Schreiber F, Strickert M (2009) CUDA-based multi-core implementation of MDS-based bioinformatics algorithms. In: Grosse I, Neumann S, Posch S, Schreiber F, Stadler PF (eds) GCB, LNI, vol 157. GI, Bonn, pp 67–79

    Google Scholar 

  12. 12.

    Filatovas E, Podkopaev D, Kurasova O (2015) A visualization technique for accessing solution pool in interactive methods of multiobjective optimization. Int J Comput Commun Control 10:508–519

    Article  Google Scholar 

  13. 13.

    Garzón EM, Moreno JJ, Martínez JA (2017) An approach to optimise the energy efficiency of iterative computation on integrated GPU–CPU systems. J Supercomput 73(1):114–125

    Article  Google Scholar 

  14. 14.

    Goldberger J, Gordon S, Greenspan H (2003) An efficient image similarity measure based on approximations of KL-divergence between two Gaussian mixtures. In: ICCV. IEEE Computer Society, pp 487–493

  15. 15.

    Hout MC, Goldinger SD, Brady KJ (2014) MM-MDS: a multidimensional scaling database with similarity ratings for 240 object categories from the massive memory picture database. PLoS ONE 9(11):1–11

    Article  Google Scholar 

  16. 16.

    Ingram S, Munzner T, Olano M (2009) Glimmer: multilevel MDS on the GPU. IEEE Trans Vis Comput Gr 15(2):249–261

    Article  Google Scholar 

  17. 17.

    Intel Math Kernel Library (Documentation) (2017) Accessed 24 Feb 2018

  18. 18.

    Kurasova O, Petkus T, Filatovas E (2013) Visualization of pareto front points when solving multi-objective optimization problems. Inf Technol Control 42(4):353–361

    Google Scholar 

  19. 19.

    Leng J et al (2013) GPUWattch: enabling energy optimizations in GPGPUs. SIGARCH Comput Archit News 41(3):487–498

    Article  Google Scholar 

  20. 20.

    Martínez JA, Almeida F, Garzón EM, Acosta A, Blanco V (2011) Adaptive load balancing of iterative computation on heterogeneous nondedicated systems. J Supercomput 58(3):385–393.

    Article  Google Scholar 

  21. 21.

    Martínez JA, Garzón EM, Plaza A, García I (2011) Automatic tuning of iterative computation on heterogeneous multiprocessors with ADITHE. J Supercomput 58(2):151–159

    Article  Google Scholar 

  22. 22.

    Medvedev V, Kurasova O, Bernatavičienė J, Treigys P, Marcinkevičius V, Dzemyda G (2017) A new web-based solution for modelling data mining processes. Simul Model Pract Theory 76:34–46

    Article  MATH  Google Scholar 

  23. 23.

    Morrison A, Ross G, Chalmers M (2003) Fast multidimensional scaling through sampling, springs and interpolation. Inf Vis 2(1):68–77

    Article  Google Scholar 

  24. 24.

    Orts F, Filatovas E, Ortega G, Kurasova O, Garzón EM (2017) HPC tool for multidimensional scaling. In: Vigo-Aguiar J (ed) Proceedings of the 17th international conference on computational and mathematical methods in science and engineering, vol 5, pp 1611–1614

  25. 25.

    Osipyan H, Morton A, Marchand-Maillet S (2014) Fast interactive information retrieval with sampling-based MDS on GPU architectures. In: Information retrieval facility conference. Springer, Cham, pp 96–107. ISBN: 978-3-319-12978-5

  26. 26.

    Papenhausen E, Wang B, Ha S, Zelenyuk A, Imre D, Mueller K (2013) GPU-accelerated incremental correlation clustering of large data with visual feedback. In: Proceedings of the 2013 IEEE international conference on big data, 6–9 Oct 2013, Santa Clara, CA, USA, pp 63–70

  27. 27.

    Park S, Shin SY, Hwang KB (2012) CFMDS: CUDA-based fast multidimensional scaling for genome-scale data. BMC Bioinform 13(17):S23

    Google Scholar 

  28. 28.

    Pawliczek P, Dzwinel W, Yuen DA (2014) Visual exploration of data by using multidimensional scaling on multicore CPU, GPU, and MPI cluster. Concurr Comput 26(3):662–682

    Article  Google Scholar 

  29. 29.

    Qiu J, Bae SH (2012) Performance of windows multicore systems on threading and MPI. Concurr Comput Pract Exp 24(1):14–28

    Article  Google Scholar 

  30. 30.

    Shmoys DB, Tardos E (1993) An approximation algorithm for the generalized assignment problem. Math Program 62(3):461–474

    MathSciNet  Article  MATH  Google Scholar 

  31. 31.

    Wong A, Rexachs D, Luque E (2015) Parallel application signature for performance analysis and prediction. IEEE Trans Parallel Distrib Syst 26(7):2009–2019

    Article  Google Scholar 

  32. 32.

    Yang T, Liu J, McMillan L, Wang W (2006) A fast approximation to multidimensional scaling. In: IEEE workshop on computation intensive methods for computer vision

  33. 33.

    Zhong Z, Rychkov V, Lastovetsky A (2014) Data partitioning on multicore and multi-GPU platforms using functional performance models. IEEE Trans Comput 64(9):2506–2518

    MathSciNet  Article  MATH  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to G. Ortega.

Additional information

This work has been partially supported by the Spanish Ministry of Science throughout Projects TIN2015-66680 and CAPAP-H5 network TIN2014-53522, by J. Andalucía through Projects P12-TIC-301 and P11-TIC7176, by the European Regional Development Fund (ERDF), and by the European COST Action IC1305: Network for sustainable Ultrascale computing (NESUS).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Orts, F., Filatovas, E., Ortega, G. et al. Improving the energy efficiency of SMACOF for multidimensional scaling on modern architectures. J Supercomput 75, 1038–1050 (2019).

Download citation


  • Dimensionality reduction
  • Multidimensional scaling
  • Energy efficiency
  • SMACOF algorithm