## Abstract

Specialized hardware for machine learning allows us to train highly accurate models in hours which would otherwise take days or months of computation time. The advent of recent deep learning techniques can largely be explained by the fact that their training and inference rely heavily on fast matrix algebra that can be accelerated easily via programmable graphics processing units (GPU). Thus, vendors praise the GPU as *the* hardware for machine learning. However, those accelerators have an energy consumption of several hundred Watts. In distributed learning, each node has to meet resource constraints that exceed those of an ordinary workstation—especially when learning is performed at the edge, i.e., close to the data source. The energy consumption is typically highly restricted, and relying on high-end CPUs and GPUs is thus not a viable option. In this work, we present our new quantum-inspired machine learning hardware accelerator. More precisely, we explain how our hardware approximates the solution to several NP-hard data mining and machine learning problems, including *k*-means clustering, maximum-a-posterior prediction, and binary support vector machine learning. Our device has a worst-case energy consumption of about 1.5 W and is thus especially well suited for distributed learning at the edge.

You have full access to this open access chapter, Download conference paper PDF

### Similar content being viewed by others

## Keywords

## 1 Introduction

Hardware acceleration for machine learning usually refers to GPU implementations that can do fast linear algebra to enhance the speed of numerical computations. This, however, includes the implicit assumptions that (1) the learning problem can actually benefit from fast linear algebra, i.e., the most complex parts of learning and inference can be phrased in the language of matrix-vector calculus. And (2), learning is carried out in an environment where energy supply, size, and weight of the system are mostly unrestricted. The latter assumption is indeed violated when learning has to be carried out at the edge, that is, on the device that actually measures the data.

Especially in the distributed or federated learning settings, edge devices are subject to strong resource constraints. Communication efficiency [6] and computational burden [3, 7] must be reduced, in order to get along with the available hardware. One way to address these issues are efficient decentralized learning schemes [5]. However, the resource consumption of state-of-the-art hardware accelerators are often out of reach for edge devices.

We thus present a novel hardware architecture that can be used as a solver at the core of various data mining and machine learning techniques, some of which we will explain in the following sections. Our work is inspired by the so-called *quantum annealer*, a hardware architecture for solving discrete optimization problems by exploiting quantum mechanical effects. In contrast to GPUs and quantum annealers, our device has a highly reduced resource consumption. The power consumption of four machine learning accelerators is shown in Fig. 1. For the CPU and GPU, we provide the thermal design power (TDP), whereas the FPGA’s value is the total on-chip power, calculated using the Vivado Design Suite^{Footnote 1}. We see that the actual peak consumption of CPUs and GPUs exceeds the energy consumption of our device by several orders of magnitude. Moreover, we provide the estimated energy consumption of the D-Wave 2000Q quantum annealer. The annealer optimizes the exact same objective function as our device, but the cooling and magnetic shielding required for its operation leads to an enormous energy consumption, which is very impractical for real applications at its current stage. Hence, its low resource requirements and versatility makes our device the ideal hardware accelerator for distributed learning at the edge.

Our approach is different to GPU programming in that our hardware is designed to solve a fixed class \(\mathcal {C}\) of parametric optimization problems. “Programming” our device is then realized by reducing a learning problem to a member of \(\mathcal {C}\) and transferring only the resulting coefficients \(\beta \). The optimization step, e.g. model training, is performed entirely on the board, without any additional communication cost.

In our demo setup (shown in Fig. 2) we will showcase several machine learning tasks in a live setting on multiple devices, accompanied by live visualizations of the learning progress and the results.

The underlying idea of using a non-universal compute-architecture for machine learning is indeed not new: State-of-the-art quantum annealers rely on the very same problem formulation. There, optimization problem are encoded as potential energy between *qubits* – the global minimum of a loss function can be interpreted as the quantum state of lowest energy [4]. The fundamentally non-deterministic nature of quantum methods makes the daunting task of traversing an exponentially large solution space feasible. However, their practical implementation is a persisting challenge, and the development of actual quantum hardware is still in its infancy. The latest flagship, the *D-Wave 2000Q*, can handle problems with 64 fully connected bits^{Footnote 2}, which is by far not sufficient for realistic problem sizes.

Nevertheless, the particular class of optimization problems that quantum annealers can solve is well understood which motivates its use for hardware accelerators outside of the quantum world.

## 2 Boolean Optimization

A *pseudo-Boolean function* (PBF) is any function \(f:\mathbb {B}^n\mapsto \mathbb {R}\) that assigns a real value to a fixed-length binary vector. Every PBF on *n* binary variables can be uniquely expressed as a polynomial of some degree \(d\le n\) with real-valued coefficients [2]. *Quadratic Unconstrained Binary Optimization* (QUBO) is the problem of finding an assignment \(\varvec{x}^*\) of *n* binary variables that is minimal with respect to a second degree Boolean polynomial:

It has been shown that all higher-degree pseudo-Boolean optimization problems can be reduced to quadratic problems [2]. For this reason a variety of well-known optimization problems like (Max-)3SAT and prime factorization, but also ML-related problems like clustering, maximum-a-posterior (MAP) estimation in Markov Random Fields and binary constrained SVM learning can be reduced to QUBO or its *Ising*-variant (where \(\varvec{x}\in \{-1,+1\}^n\)). In our demo, we will explain the impact of different reductions in terms of runtime and quality of different learning tasks.

## 3 Evolutionary QUBO Solver

If no specialized algorithm is known for a particular hard combinatorial optimization problem, randomized search heuristics, like simulated annealing or *evolutionary algorithms* (EA), provide a generic way to generate good solutions.

Inspired by biological evolution, EAs employ recombination and mutation on a set of “parent” solutions to produce a set of “offspring” solutions. A loss function, also called *fitness function* in the EA-context, is used to select those solutions which will constitute the next parent generation. This process is repeated until convergence or a pre-specified time-budget is exhausted [8].

Motivated by the inherently parallel nature of digital circuits, we developed a highly customizable \((\mu +\lambda )\)-EA architecture on FPGA hardware implemented using the VHDL language^{Footnote 3}. Here, “customizable” implies that different types of FPGA hardware, from small to large, can be used. This is done by allowing the end-user to customize the maximal problem dimension *n*, the number of parent solutions \(\mu \), the number of offspring solutions \(\lambda \) and the number of bits per coefficient \(\beta _{ij}\). In case of low-budget FPGA, this allows us to either allocate more FPGA resources for parallel computation (\(\mu \) and \(\lambda \)) or for the problem size (*n* and \(\beta \)). We will show how to generate and run chip designs for low-budget as well as high-end devices in our demo.

## 4 Exemplary Learning Tasks

“Programming” our devices reduces to determining the corresponding coefficients \(\beta \) and uploading them to the FPGA via our Python interface. We will explain how this is done for various machine learning tasks, two of which we will explain below:

A prototypical data mining problem is *k*-means clustering which is already NP-hard for \(k=2\). To derive \(\beta \) for a 2-means clustering problem, we use the method devised in [1], where each bit indicates whether the corresponding data point belongs to cluster 1 or cluster 2—the problem dimension is thus \(n=|\mathcal {D}|\). The coefficients are then derived from the centered Gramian \(\varvec{G}\) over the mean adjusted data. To keep as much precision as possible, we stretch the parameters to use the full range of *b* bits before rounding, so the final formula is \(\beta _{ij}=\lfloor \alpha \varvec{G}_{ij}+0.5\rfloor \text { with }\alpha =(2^{b-1}-1)/\max _{i,j}|\varvec{G}_{ij}|\). Exemplary results on the UCI data sets *Iris* and *Sonar* are shown in Fig. 3 (top).

Another typical NP-hard ML problem is to determine the most likely configuration of variables in a Markov Random Field, known as the MAP prediction problem [9]. Similar to efficiency differences between programs for classical universal computers, providing a different QUBO problem encoding has implications for the efficiency of our device. To demonstrate this effect, we will perform a live comparison of different encodings in terms of convergence behavior.

One possible solution for the MRF-MAP problem is to encode the assignments of all \(X_v\) as a concatenation of one-hot encodings

where \(m=|V|\) and \(x^{i}_k\) is the *i*-th value in \(\mathcal {X}_k\). The weights \(-\theta _{uv=xy}\) are encoded into the quadratic coefficients; if two different bits belong to the same variable encoding, a penalty weight is added between them to maintain a valid one-hot. The negative sign is added to turn MAP into a minimization problem.

For a different possible solution, we may assign bits \(b_{uv=xy}\) to all non-zero weights \(\theta _{uv=xy}\) between specific values *x*, *y* of two variables \(X_u, X_v\), indicating that \(X_u=x\) and \(X_v=y\). Again, to avoid multiple assignments of the same variable we introduce penalty weights between pairs of edges. We can see in Fig. 3 (bottom) that both approaches lead to a different convergence behavior.

In addition to *k*-means and MRF-MAP, the demo will include binary SVM learning, binary MRF parameter learning, and others.

A video demonstrating how to use our system by solving a clustering problem can be found here: https://youtu.be/Xj5xx-eO1Mk.

## References

Bauckhage, C., Ojeda, C., Sifa, R., Wrobel, S.: Adiabatic quantum computing for kernel k=2 means clustering. In: Proceedings of the LWDA 2018, pp. 21–32 (2018)

Boros, E., Hammer, P.L.: Pseudo-Boolean optimization. Discret. Appl. Math.

**123**(1–3), 155–225 (2002)Caldas, S., Konecný, J., McMahan, H.B., Talwalkar, A.: Expanding the reach of federated learning by reducing client resource requirements. CoRR abs/1812.07210 (2018). http://arxiv.org/abs/1812.07210

Kadowaki, T., Nishimori, H.: Quantum annealing in the transverse Ising model. Phys. Rev. E

**58**(5), 5355 (1998)Kamp, M., et al.: Efficient decentralized deep learning by dynamic model averaging. In: Berlingerio, M., Bonchi, F., Gärtner, T., Hurley, N., Ifrim, G. (eds.) ECML PKDD 2018. LNCS (LNAI), vol. 11051, pp. 393–409. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-10925-7_24

Konecný, J., McMahan, H.B., Yu, F.X., Richtárik, P., Suresh, A.T., Bacon, D.: Federated learning: strategies for improving communication efficiency. CoRR abs/1610.05492 (2016). http://arxiv.org/abs/1610.05492

Piatkowski, N., Lee, S., Morik, K.: Integer undirected graphical models for resource-constrained systems. Neurocomputing

**173**, 9–23 (2016)Schwefel, H.-P., Rudolph, G.: Contemporary evolution strategies. In: Morán, F., Moreno, A., Merelo, J.J., Chacón, P. (eds.) ECAL 1995. LNCS, vol. 929, pp. 891–907. Springer, Heidelberg (1995). https://doi.org/10.1007/3-540-59496-5_351

Wainwright, M.J., Jordan, M.I., et al.: Graphical models, exponential families, and variational inference. F+ Trends Mach. Learn.

**1**(1–2), 1–305 (2008)

## Author information

### Authors and Affiliations

### Corresponding author

## Editor information

### Editors and Affiliations

## Rights and permissions

## Copyright information

© 2020 Springer Nature Switzerland AG

## About this paper

### Cite this paper

Mücke, S., Piatkowski, N., Morik, K. (2020). Hardware Acceleration of Machine Learning Beyond Linear Algebra. In: Cellier, P., Driessens, K. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Communications in Computer and Information Science, vol 1167. Springer, Cham. https://doi.org/10.1007/978-3-030-43823-4_29

### Download citation

DOI: https://doi.org/10.1007/978-3-030-43823-4_29

Published:

Publisher Name: Springer, Cham

Print ISBN: 978-3-030-43822-7

Online ISBN: 978-3-030-43823-4

eBook Packages: Computer ScienceComputer Science (R0)