1 Introduction

GENO makes state-of-the-art performance in solving optimization problems easily accessible. Since optimization problems are ubiquitous in science, engineering and economics, it is not surprising that they come in many different flavors. Traditionally, a main distinction is made between discrete and continuous optimization problems. The focus of GENO is on the continuous case. Prominent examples for classes of continuous optimization problems are linear programs (LPs), quadratic programs (QPs), second-order cone programs (SOCPs), and semi-definite programs (SDPs). For these classes, efficient algorithms and well engineered implementations (solvers) exist for many years. The solvers are typically called from a programming environment. The optimization problems’ data are passed to the solver through function calls. It is the responsibility of the programmer to provide the data in the right format, that is compliant to a standard form for the specific problem class. The burden of reformulating the problems in standard form is alleviated by modeling languages that transform a problem specification into standard form. Popular modeling languages are CVX [12, 17] for MATLAB and its Python extension CVXPY [2, 13], Pyomo [19, 20] for Python, and JuMP [14] which is bound to Julia. These languages take an instance of an optimization problem and transform it into some standard form of an LP, QP, SOCP, or SDP, respectively. The transformed problem is then passed to a solver that expects the standard form. However, the transformation can be computationally inefficient, because the representation in standard form can be large in terms of the problem size. Also, the solver is called from within the programming environment only for the given problem instance. The modeling language plus solver approach has been made deployable in the CVXGEN [31], QPgen [16], and OSQP [5] projects. In these projects code is generated for the specified problem class and not just for one problem instance. However, the problem dimensions need to be fixed and the generated code is optimized only for very small or sparse problems. There also exist implementations of the modeling language plus solver approach that are independent from a specific programming environment. Prominent examples are AMPL [15] and GAMS [8] that are popular in the operations research community.

GENO differs from previous work by a much tighter coupling of the language and the solver. GENO does not transform problem instances but whole problem classes, including constrained problems, into a very general standard form. Since the standard form is independent of any specific problem instance it does not grow for larger instances. Hence, the generated solvers can be used like hand-written solvers. They even reach or surpass the efficiency of hand-written solvers for large dense problems. Typically, they are orders of magnitude faster than state-of-the-art modeling language plus solver approaches.

In this article, that is based on the original publications [24, 25, 28, 29], we describe the full GENO software stack. The tight coupling of modeling language and solver is achieved in GENO by computing symbolic gradients that are evaluated by the solver on the given data of the optimization problem. Hence, an important part of GENO’s software stack is a facility for computing derivatives of linear algebra expressions. GENO’s modeling language allows to specify whole classes of optimization problems in terms of the objective function and constraints that are given as vectorized linear algebra expressions. Neither the objective function nor the constraints need to be differentiable. Non-differentiable problems are transformed into constrained, differentiable problems. A general purpose solver for constrained, differentiable problems is then instantiated with the objective function, the constraint functions and their respective gradients. Using vectorized linear algebra allows a direct mapping onto optimized implementations of BLAS (Basic Linear Algebra Subroutines) routines. BLAS and its close relative LAPACK [3] are the de facto standard for the language independent high performance evaluation of linear algebra expressions. Almost all major hardware vendors provide individual BLAS implementations for their particular hardware, including CPUs (AMD Blis [43], Intel MKL [10], Arm Performance Libraries [30]) and GPUs (NVIDIA cuBLAS [11], AMD clBLAS [4]). GENO supports different hardware platforms through the autoBLAS precompiler that translates linear algebra expressions into optimized BLAS library calls for the addressed hardware.

The GENO software stack comprises a modeling language (Sect. 2), a generic solver (Sect. 3), a matrix and tensor calculus (Sect. 4), and an automatic mapping to BLAS (Sect. 5). The latter three components of GENO’s software stack are of interest in a broader context than GENO and hence have been made available independently. GENO is available at [27].

2 Modeling Language

GENO’s modeling language uses a MATLAB-like syntax for specifying optimization problems. MATLAB is a platform for numerical computations using matrices. The advantages of using matrix expressions are two-fold: First, it allows the user to phrase an optimization problem without the need of specifying the number of variables nor the number of constraints. Hence, the generated solver is not tied to a specific instance but can handle arbitrary-sized problems. Second, it enables direct mappings to BLAS routines that are much more efficient than the corresponding for-loops.

A specification in GENO has four blocks:

  1. 1.

    Declaration of the problem parameters that can be of type Matrix, Vector, or Scalar,

  2. 2.

    declaration of the optimization variables that also can be of type Matrix, Vector, or Scalar,

  3. 3.

    specification of the objective function, and finally

  4. 4.

    specification of the constraints, also in a MATLAB-like syntax that supports the following operators and functions: +, -, *, /, .*, ./, \(^\wedge \), \(.^\wedge \), \('\), log, exp, sin, cos, tanh, abs, norm1, norm2, sum, tr, det, inv.

See Fig. 1 for some illustrative examples.

Fig. 1.
figure 1

A few optimization problems formulated in the GENO modeling language. The problem on the left is an unconstrained optimization problem that computes the Rayleigh quotient, the problem in the middle is the non-negative least squares problem, and the problem on the right shows an \(\ell _1\)-norm minimization problem from the domain of compressed sensing over the unit simplex.

GENO’s modeling language also allows the specification of non-smooth optimization problems, for instance, problems that employ the norm1 function, that is, the non-smooth \(\ell _1\)-norm. The non-smooth optimization problems that are allowed by GENO can be written as \(\min _x \{\max _i f_i(x)\}\) with smooth functions \(f_i(x)\) [36], which is a fairly flexible class that accommodates many of the commonly-encountered non-smooth objective functions. All problems within this class can be transformed into constrained, smooth problems of the form

$$ \displaystyle \min _{t, x}\,\, t \quad {{\,\mathrm{s.t.}\,}}\quad f_i(x) \le t \,\, \forall i. $$

The transformed problems are then solved by a solver for constrained, smooth optimization problems. Hence, within the GENO software stack only a solver for constrained, smooth optimization problems is needed. In the next section we describe the solver that is implemented in the GENO software stack.

3 Generic Optimizer

GENO’s generic optimizer employs a solver for unconstrained, smooth optimization problems. This solver is then extended to handle also constraints. The choice for the solver that is implemented within the GENO software stack is motivated by applications in machine learning. Optimization problems in machine learning typically exhibit a few dozen up to a few million variables, and the involved data matrices do not have any special structure and are typically not sparse, that is, at least 10% of the entries are non-zero entries. These properties exclude second-order optimization algorithms and justify our choice to implement a slightly modified version of the L-BFGS-B algorithm [9, 44] that can handle smooth optimization problems that have no general constraints, except possibly bound constraints on the variables. It provides a good trade-off between the number of iterations and the complexity per iteration. It also does not assume any structure on the problem data and it is numerically quite robust. On quadratic problems it shares the same convergence guarantees [22, 34] as Nesterov’s optimal gradient descent method [35] but compared to Nesterov’s method it is parameter free, i.e., no parameters need to be tuned or known for the specific problem.

3.1 Solver for Bound-Constrained Smooth Problems

The solver for bound-constrained, smooth optimization problems combines a standard limited memory quasi-Newton method with a projected gradient path approach. In each iteration, the gradient path is projected onto the box constraints and the quadratic function based on the second-order approximation (L-BFGS) of the Hessian is minimized along this path. All variables that are at their boundaries are fixed and only the remaining free variables are optimized using the second-order approximation. Any solution that is not within the bound constraints is projected back onto the feasible set by a simple min/max operation [32]. Only in rare cases, a projected point does not form a descent direction. In this case, instead of using the projected point, one picks the best point that is still feasible along the ray towards the solution of the quadratic approximation. Then, a line search is performed for satisfying the strong Wolfe conditions [41, 42]. This ensures convergence also in the non-convex case. The line search also removes the need for a predefined step length parameter. We use the line search proposed in [33] which we enhance by a backtracking line search in case the solver enters a region where the function is not defined.

3.2 Solver for Constrained Smooth Problems

There are quite a few options for solving smooth, constrained optimization problems. We decided to use the augmented Lagrangian approach [21, 38]. It allows to (re-)use our solver for smooth, unconstrained problems, it is fairly robust, and does not need to tune any parameters. The augmented Lagrangian method can be used for solving the following general standard form of an abstract constrained optimization problem

$$\begin{aligned} \displaystyle \min _{x} \,\, f(x) \quad {{\,\mathrm{s.t.}\,}}\quad h(x) = 0 \,\text { and }\, g(x) \le 0, \end{aligned}$$

where \(x\in \mathbb {R}^n\), \(f:\mathbb {R}^n \rightarrow \mathbb {R}\), \(h:\mathbb {R}^n\rightarrow \mathbb {R}^m\), \(g:\mathbb {R}^n\rightarrow \mathbb {R}^p\) are differentiable functions, and the equality and inequality constraints are understood component-wise.

The augmented Lagrangian of Problem (1) is the following function

$$ L_\rho (x, \lambda , \mu ) = f(x) + \frac{\rho }{2} \left\| h(x)+\frac{\lambda }{\rho }\right\| ^2 + \frac{\rho }{2} \left\| \left( g(x) + \frac{\mu }{\rho }\right) _+\right\| ^2, $$

where \(\lambda \in \mathbb {R}^m\) and \(\mu \in \mathbb {R}_{\ge 0}^p\) are the Lagrange multipliers, also known as dual variables, \(\rho >0\) is a constant, \(\left\| \cdot \right\| \) denotes the Euclidean norm, and \((v)_+\) denotes \(\max \{v, 0\}\). The augmented Lagrangian is the standard Lagrangian of Problem (1) augmented by a quadratic penalty term. The quadratic term provides increased stability during the optimization process which can be seen, for example, in the case that Problem (1) is a linear program.

The Augmented Lagrangian Algorithm 1 runs in iterations. Upon convergence, it will return an approximate solution x to the original problem along with an approximate solution of the Lagrange multipliers for the dual problem. If Problem (1) is convex, then the algorithm returns the global optimal solution. Otherwise, it returns a local optimum [6]. The update of the multiplier \(\rho \) can be ignored and the algorithm still converges [6]. However, in practice it is beneficial to increase it depending on the progress in satisfying the constraints [7]. If the infinity norm of the constraint violation decreases by a factor less than \(\tau =1/2\) in one iteration, then \(\rho \) is multiplied by a factor of two.

figure a

4 Matrix and Tensor Calculus

The solver at the core of GENO’s generic optimizer, an implementation of the L-BFGS-B algorithm for bound-constrained smooth problems, runs in iterations. In each iteration expressions for the objective function and its gradients are evaluated. Within GENO these expressions, especially the gradients, have to be made available to the solver. Expressions for objective functions are given in GENO’s modeling language that uses a vectorized notation, that is, a notation that avoids explicit indices. The advantage of a vectorized notation is that expressions can be mapped more or less directly to BLAS calls and thus to highly optimized BLAS implementations. For GENO we also want this advantage for the gradients. Hence, we need to compute derivatives of matrix expressions. Although computing derivatives of matrix and tensor expressions is a fundamental and frequent task, surprisingly, no algorithm existed that would solve this problem in the general case. In the following, we describe our approach [24, 28] that for the first time allowed to compute derivatives of general tensor expressions. It was shown in [24] that evaluating derivatives of non-scalar valued functions computed by this approach is two orders of magnitude faster than previous state-of-the-art approaches when evaluated on the CPU and up to three orders of magnitude faster when evaluated on the GPU. An implementation of our approach is integrated into the GENO software stack. It is also available as a standalone tool at [26].

4.1 Problems with Matrix Notation

Computing derivatives for scalar functions, i.e., \(f(x):\mathbb {R}\rightarrow \mathbb {R}\) is a straightforward task and is taught already in high school. One just applies the chain rule repeatedly and the partial derivatives are multiplied together. For instance, consider the function \(f(x) = sin(x^2)\). Its derivative is \(f^\prime (x) = cos(x^2)\cdot 2\cdot x\). However, this no longer works in the matrix and tensor case. Compared to the scalar case where only one type of multiplication operator exists, there are several types of multiplication in the matrix and tensor case. It has been shown that 24 types of different multiplications are necessary for representing the derivatives of matrix expressions only in the linear case [37]. Hence, it is essential to find a good representation of matrix and tensor multiplications.

Furthermore, when computing derivatives of vector and matrix expressions, even matrix notation is not sufficient to express all derivatives. For instance, for function \(f(x):\mathbb {R}^n\rightarrow \mathbb {R}^m\), the derivative will be a matrix \(M\in \mathbb {R}^{m\times n}\). But already its second derivative will be \(T\in \mathbb {R}^{m\times n\times n}\), i.e., a third order tensor, which cannot be represented in standard matrix notation. One usually circumvents this by using the \({{\,\textrm{vec}\,}}\)-operator that maps a matrix to a vector by stacking its columns on top of each other and using the Kronecker product. This way, one can flatten some dimensions and emulate higher order tensors and their multiplications. However, still not all necessary multiplications can be represented this way and it unnecessarily complicates the representation. And even in the two-dimensional case, i.e., when the derivative is a tensor of order two, it might have no corresponding representation as a matrix. For instance, consider the simple quadratic function \(f(x) = x^\top Ax\), where \(x\in \mathbb {R}^n\) is a vector and \(A\in \mathbb {R}^{n \times n}\) is a matrix. When computing the derivative of f with respect to x using the chain rule, one has to compute the derivative of \(x^\top \) with respect to x, i.e., the derivative of a function that maps x to its transpose. This is not the identity matrix. In fact, it is not even representable as a matrix. In the more powerful Ricci notation [39] it would be written as the tensor \(\delta _{ij}\). Hence, the right representation of tensors and operators on them, especially multiplications between them is crucial. In fact, choosing the right representation has led to the first general and coherent matrix and tensor calculus theory [24]. Before, only a number of cases could be treated systematically. While the first theory used Ricci notation to represent tensors and their multiplications it turned out that using a generalized form of Einstein notation makes the process of computing derivatives even simpler and more coherent [28].

4.2 Einstein Notation

In tensor calculus one can distinguish three types of multiplication, namely inner, outer, and element-wise multiplication. Indices are used for distinguishing between these types. For tensors AB,  and C any multiplication of A and B can be written as

$$\begin{aligned} C[s_3] = \sum _{(s_1\cup s_2)\setminus s_3} A[s_1] \cdot B[s_2], \end{aligned}$$

where C is the result tensor and \(s_1, s_2\), and \(s_3\) are the index sets of the left argument, the right argument, and the result tensor, respectively. The summation is over all indices that appear in at least one of the two multiplication’s arguments A and B and are not present in the result tensor C. The index set of the result tensor is always a subset of the union of the index sets of the multiplication’s arguments, that is, \(s_3\subseteq (s_1\cup s_2)\). In the following we denote the generic tensor multiplication as defined in Eq. (2) simply as

$$ C = A *_{(s_1, s_2, s_3)} B. $$

This notation is basically identical to the tensor multiplication einsum in NumPy, TensorFlow, and PyTorch, and to the notation used in the Tensor Comprehension Package [40].

Note, that the \(*_{(s_1, s_2, s_3)}\)-notation comes close to standard Einstein notation. In Einstein notation the index set \(s_3\) of the output is omitted and the convention is to sum over all shared indices in \(s_1\) and \(s_2\). However, this restricts the types of multiplications that can be represented. The set of multiplications that can be represented in standard Einstein notation is a proper subset of the multiplications that can be represented by our notation. For instance, standard Einstein notation is not capable of representing element-wise multiplications directly. Still, in the following we refer to the \(*_{(s_1, s_2, s_3)}\)-notation simply as Einstein notation as it is standard practice in many linear algebra packages.

4.3 Tensor Calculus

In the following, let \(\left\| A\right\| =\sqrt{\sum _s A[s]^2}\) denote the norm of a tensor A. For vectors it coincides with the Euclidean norm and for matrices with the Frobenius norm. The following definition generalizes the standard derivative to the multi-dimensional case.

Definition 1 (Fréchet Derivative)

Let \(f:\mathbb {R}^{n_1\times n_2\times \ldots \times n_k}\rightarrow \mathbb {R}^{m_1\times m_2\times \ldots \times m_l}\) be a function that takes an order-k tensor as input and maps it to an order-l tensor as output. Then, \(D\in \mathbb {R}^{m_1\times m_2\times ... \times m_l \times n_1\times n_2\times \ldots \times n_k}\) is called the derivative of f at x if and only if

$$ \lim _{h\rightarrow 0} \frac{\left\| f(x+h) - f(x) - D\circ h\right\| }{\left\| h\right\| } = 0, $$

where \(\circ \) is an inner tensor product.

Here, the dot product notation \(D\circ h\) is short for the inner product \(D*_{(s_1s_2, s_2, s_1)}h\), where \(s_1s_2\) is the index set of D and \(s_2\) is the index set of h. For instance, if \(D\in \mathbb {R}^{m_1\times n_1 \times n_2}\) and \(h\in \mathbb {R}^{n_1\times n_2}\), then \(s_1=\{i,j,k\}\) and \(s_2=\{j,k\}\).

With this definition at hand, we can compute derivatives of matrix and tensor expressions in Einstein notation. As noted in the beginning of this section, derivatives are usually computed using the chain rule. There are two major orderings in which we can apply the chain rule; in a forward fashion and in a reverse fashion. These ways are known as forward mode and reverse mode in the area of algorithmic differentiation (AD, aka. automatic differentiation) [18]. They will both result in the same derivative but not necessarily in the same expression for the derivative. The forward mode coincides with what is usually taught in high school and commonly refers to as symbolic computation of derivatives [23]. Here, we will only describe the reverse mode since this is the mode that is used within the GENO software stack.

Any expression can be represented as a directed acyclic expression graph (expression DAG). Figure 2 shows the expression DAG for the objective function of the logistic regression, i.e.,

$$\begin{aligned} 1^\top (y\odot \log (\exp (Xw)+1)), \end{aligned}$$

where \(\odot \) denotes the element-wise multiplication.

Fig. 2.
figure 2

Expression DAG for Expression (3).

The nodes of the DAG that have no incoming edges represent the variables or constants of the expression and are referred to as input nodes. The nodes of the DAG that have no outgoing edges represent the functions that the DAG computes and are referred to as output nodes. Let the DAG have n input nodes (variables) and m output nodes (functions). Note, that the DAG in Fig. 2 has only one output node. We label the input nodes as \(x_0, ..., x_{n-1}\), the output nodes as \(y_0, ..., y_{m-1}\), and the internal nodes as \(v_0,\ldots , v_{k-1}\). Every internal and every output node represents an operator whose arguments are supplied by the incoming edges.

When evaluating the DAG, i.e., computing the function values that the DAG represents for some given input, one proceeds from the input nodes to the output nodes. In forward mode automatic differentiation one proceeds in the same direction for computing the derivative and in reverse mode in reverse order, i.e., from output to input nodes. Each node \(v_i\) will eventually store the derivative \(\frac{\partial y_j}{\partial v_i}\) which is usually denoted as \(\bar{v}_i\), where \(y_j\) is the function to be differentiated. This partial derivative is often referred to as adjoint. These derivatives are computed as follows: First, the derivatives \(\frac{\partial y_j}{\partial y_i}\) are stored at the output nodes of the DAG. Then, the derivatives that are stored at the remaining nodes, here called z, are iteratively computed by summing over all their outgoing edges as follows

$$\begin{aligned} \bar{z} = \frac{\partial y_j}{\partial z} = \sum _{f\,:\, (z, f)\in E} \frac{\partial y_j}{\partial f}\cdot \frac{\partial f}{\partial z} = \sum _{f\,:\, (z, f)\in E} \bar{f} \cdot \frac{\partial f}{\partial z}, \end{aligned}$$

where the multiplication is again tensorial. The following theorems specify the type of tensor multiplication for reverse mode Eq. (4). Their proofs can be found in [29].

Theorem 1

Let Y be an output node with index set \(s_4\) and let \(C = A *_{(s_1, s_2, s_3)} B\) be a multiplication node of the expression DAG. Then the contribution of C to the adjoint \(\bar{B}\) of B is \(\bar{C}*_{(s_4s_3, s_1, s_4s_2)}A\) and its contribution to the adjoint \(\bar{A}\) of A is \(\bar{C}*_{(s_4s_3, s_2, s_4s_1)}B\).

If the output function Y in Theorem 1 is scalar-valued, then we have \(s_4=\emptyset \) and the adjoint coincides with the function implemented in all modern deep learning frameworks including TensorFlow and PyTorch. Hence, our approach can be seen as a direct generalization of the scalar case.

Theorem 2

Let Y be an output function with index set \(s_3\), let f be a general unary function whose domain has index set \(s_1\) and whose range has index set \(s_2\), let A be a node in the expression DAG, and let \(C= f(A)\). The contribution of the node C to the adjoint \(\bar{A}\) is

$$\begin{aligned} \bar{f}*_{(s_3s_2,s_2s_1,s_3s_1)} f'(A), \end{aligned}$$

where \(f^\prime \) is the derivative of f.

In case that the general unary function is simply an elementwise unary function that is applied element-wise to a tensor, Theorem 2 simplifies as follows.

Theorem 3

Let Y be an output function with index set \(s_2\), let f be an elementwise unary function, let A be a node in the expression DAG with index set \(s_1\), and let \(C= f(A)\) where f is applied element-wise. The contribution of the node C to the adjoint \(\bar{A}\) is

$$\begin{aligned} \bar{f}*_{(s_2s_1,s_1,s_2s_1)} f'(A), \end{aligned}$$

where \(f^\prime \) is the derivative of f.

Table 1 shows the individual steps of the reverse mode applied to the expression graph in Fig. 2. Note, that reverse mode manages to compute the derivative of the output function with respect to all input variables in one pass. Again, the last column shows the derivatives in matrix notation when a few simplifications have been applied, like removal of zero and identity tensors. From the first two rows we can read off the derivative of f with respect to X and the derivative with respect to w. The values of the intermediate results and common subexpressions \(v_1\) and \(v_2\) can be substituted again to obtain the final expression \(X^\top \cdot (y\odot \exp (Xw) \oslash \exp (Xw + 1))\). This expression can then be mapped very easily to a NumPy expression. In the next section, we will discuss how to map such expressions also to different hard- and software backends.

Table 1. Individual steps of the reverse mode automatic differentiation of the logistic regression function, i.e., \(1^\top (y\odot \log (\exp (Xw)+1))\) with respect to all input variables.

5 autoBLAS

GENO aims at providing state-of-the-art performance on a wide variety of backends that include multicore CPUs and GPUs. Hence, it is necessary to generate efficient code for all these backends. This is the purpose of autoBLAS. GENO does not need to directly compile the specification of an optimization problem into executable code but it can map it to an intermediate representation where linear algebra expressions are given as blocks of autoBLAS code. The autoBLAS precompiler then compiles the intermediate code into standard code for the specified backends. autoBLAS itself features an intuitive syntax for linear algebra expressions that is easy to read and comprehend, and delegates the details about their execution to highly-efficient implementations of BLAS routines for the respective backends like for instance AMD Blis [43], Intel MKL [10], Arm Performance Libraries [30], NVIDIA cuBLAS [11], and AMD clBLAS [4].

5.1 A Simple autoBLAS Example

For illustrating autoBLAS, we discuss a minimal example, namely, a matrix-vector product. Listing 1.1 shows a snippet of C++ code initializing a set of std::vectors that represent vectors and matrices, followed by a pragma-style declaration of an autoBLAS section. The autoBLAS section first declares two vectors x and y, and a matrix A. Each declaration comes with a set of name-value pairs, like or rows=rows, that describe required properties for generating code to evaluate expressions of the associated variables. The set of supported names and restrictions on values lies in the responsibility of the selected host-language-context. Here, the C-language context has been chosen by setting c=c. The currently supported contexts are the C-language, the Eigen library, the NumPy library, and cuBLAS (CUDA).

figure b

The code in Listing 1.1 is, of course, no valid C++ code and cannot be compiled directly with a standard C++ compiler. In order to get host-language code for the expressions that are stated in embedded autoBLAS sections, the autoBLAS precompiler has to be called first. Listing 1.2 shows how to invoke the autoBLAS precompiler on a file In this simple example, the -b flag is set to select the target routines for the host-language mappings, here, the standard C-binding cblas.

figure c

In our specific example, the autoBLAS precompiler replaces the autoBLAS section in the host-language file with a call to gemv, which is the BLAS routine that computes matrix-vector products [1]. The generated code, shown in Listing 1.3, is now valid C++ code that can be passed to a conforming compiler like gcc.

figure d

If we want to generate code for the CUDA backend, then we just have to invoke the autoBLAS precompiler with autoblas -b cuda<> example.c. The generated code, shown in Listing 1.4, is now valid C++/Cuda code that again can be passed directly to a conforming compiler.

figure e

5.2 Design

By defining an embedded language of its own, autoBLAS is as intuitive to use as task specific frameworks like MATLAB when it comes to expressing what to compute. Additionally, by not being bound to a particular programming language, autoBLAS can perform any transformation and optimization necessary on the symbolic level at compile time, even beyond the scope of a single statement. Finally, autoBLAS delegates the task of how to evaluate the optimized expressions by generating the corresponding BLAS calls. This allows the user to utilize highly efficient BLAS implementations for the target platform without having to write these calls by hand.

Figure 3 illustrates the three abstract steps of the autoBLAS compiler. The frontend is the user-facing part of autoBLAS and comprises both the expression syntax as well as the context selection. The context specifies attributes of the variables like, for instance, their memory layout and the BLAS selections available at compile time.

Fig. 3.
figure 3

The design of autoBLAS is divided in three independent components: the user-facing frontend, the optimizing core, and the executing backend.

The core implements a set of symbolic optimizations to increase execution performance at runtime, while also allocating and reusing memory for temporaries, if necessary. By performing these syntax-tree optimizations independent of a specific target API, autoBLAS provides a uniform evaluation semantic across different target platforms, hereby, minimizing unpleasant surprises like different operator semantics or optimization behavior when switching between libraries.

The backend generates code for the optimized expressions for the respective linear algebra library selected by the caller. Backends define a set of necessary attributes for evaluating expressions into code. For instance, for the cblas [1] backend, a dense matrix is often represented by a data pointer, a storage orientation, the number of rows and columns and the size of the leading dimension. A context is compatible with a specific backend if it provides all necessary attributes for a particular data type.

An advantage of selecting a BLAS-like backend is, that when later profiling the code, the user is able to directly refer to potential bottlenecks. This is in contrast to template-based libraries like Eigen, where the actually called routines are not directly visible and do not correspond to a particular line within the host-language code.

A major benefit of the separation into frontend, core and backend is that extending autoBLAS with a new backend is rather simple and in practice merely requires to derive from a class and implement BLAS-expression to target code mappings. At the same time, a developer who is extending autoBLAS in this way still benefits from all the symbolic optimizations implemented in the autoBLAS core.

6 Conclusions

Making generic optimization (GENO) work efficiently requires several fairly different interoperable software components. In this chapter we have described such components and their integration into the GENO software stack. By carefully designing, implementing and integrating the components in the GENO software we are able to generate optimization code that is competitive with problem-specific hand-written solvers and orders of magnitude faster than competing approaches that are comparably easy to use. Furthermore, the components, specifically, the generic optimizer, the matrix and tensor calculus, and autoBLAS, are of independent interest and are also used in other projects than GENO.