Abstract
GENO (generic optimization) is a domain specific language for mathematical optimization. The GENO software generates a solver from a specification of an optimization problem class. The optimization problems, that is, their objective function and constraints, are specified in a formal language. The problem specification is then translated into a general normal form. Problems in normal form are then passed on to a general purpose solver. In its Iterations, the solver evaluates expressions for the objective function, constraints, and their derivatives. Hence, computing symbolic gradients of linear algebra expressions is an important component of the GENO software stack. The expressions are evaluated on the available hardware platforms including CPUs and GPUs from different vendors. This becomes possible by compiling the expressions into BLAS (Basic Linear Algebra Subroutines) calls that have been optimized for the different hardware platforms by their vendors. The compiler, called autoBLAS, that translates formal linear algebra expressions into optimized BLAS calls is another important component in the GENO software stack. By putting all the components together the generated solvers are competitive with problemspecific handwritten solvers and orders of magnitude faster than competing approaches that offer comparable easeofuse. While this article describes the full GENO software stack, its components are of also of interest on their own and thus have been made available independently.
You have full access to this open access chapter, Download chapter PDF
Similar content being viewed by others
Keywords
1 Introduction
GENO makes stateoftheart performance in solving optimization problems easily accessible. Since optimization problems are ubiquitous in science, engineering and economics, it is not surprising that they come in many different flavors. Traditionally, a main distinction is made between discrete and continuous optimization problems. The focus of GENO is on the continuous case. Prominent examples for classes of continuous optimization problems are linear programs (LPs), quadratic programs (QPs), secondorder cone programs (SOCPs), and semidefinite programs (SDPs). For these classes, efficient algorithms and well engineered implementations (solvers) exist for many years. The solvers are typically called from a programming environment. The optimization problems’ data are passed to the solver through function calls. It is the responsibility of the programmer to provide the data in the right format, that is compliant to a standard form for the specific problem class. The burden of reformulating the problems in standard form is alleviated by modeling languages that transform a problem specification into standard form. Popular modeling languages are CVX [12, 17] for MATLAB and its Python extension CVXPY [2, 13], Pyomo [19, 20] for Python, and JuMP [14] which is bound to Julia. These languages take an instance of an optimization problem and transform it into some standard form of an LP, QP, SOCP, or SDP, respectively. The transformed problem is then passed to a solver that expects the standard form. However, the transformation can be computationally inefficient, because the representation in standard form can be large in terms of the problem size. Also, the solver is called from within the programming environment only for the given problem instance. The modeling language plus solver approach has been made deployable in the CVXGEN [31], QPgen [16], and OSQP [5] projects. In these projects code is generated for the specified problem class and not just for one problem instance. However, the problem dimensions need to be fixed and the generated code is optimized only for very small or sparse problems. There also exist implementations of the modeling language plus solver approach that are independent from a specific programming environment. Prominent examples are AMPL [15] and GAMS [8] that are popular in the operations research community.
GENO differs from previous work by a much tighter coupling of the language and the solver. GENO does not transform problem instances but whole problem classes, including constrained problems, into a very general standard form. Since the standard form is independent of any specific problem instance it does not grow for larger instances. Hence, the generated solvers can be used like handwritten solvers. They even reach or surpass the efficiency of handwritten solvers for large dense problems. Typically, they are orders of magnitude faster than stateoftheart modeling language plus solver approaches.
In this article, that is based on the original publications [24, 25, 28, 29], we describe the full GENO software stack. The tight coupling of modeling language and solver is achieved in GENO by computing symbolic gradients that are evaluated by the solver on the given data of the optimization problem. Hence, an important part of GENO’s software stack is a facility for computing derivatives of linear algebra expressions. GENO’s modeling language allows to specify whole classes of optimization problems in terms of the objective function and constraints that are given as vectorized linear algebra expressions. Neither the objective function nor the constraints need to be differentiable. Nondifferentiable problems are transformed into constrained, differentiable problems. A general purpose solver for constrained, differentiable problems is then instantiated with the objective function, the constraint functions and their respective gradients. Using vectorized linear algebra allows a direct mapping onto optimized implementations of BLAS (Basic Linear Algebra Subroutines) routines. BLAS and its close relative LAPACK [3] are the de facto standard for the language independent high performance evaluation of linear algebra expressions. Almost all major hardware vendors provide individual BLAS implementations for their particular hardware, including CPUs (AMD Blis [43], Intel MKL [10], Arm Performance Libraries [30]) and GPUs (NVIDIA cuBLAS [11], AMD clBLAS [4]). GENO supports different hardware platforms through the autoBLAS precompiler that translates linear algebra expressions into optimized BLAS library calls for the addressed hardware.
The GENO software stack comprises a modeling language (Sect. 2), a generic solver (Sect. 3), a matrix and tensor calculus (Sect. 4), and an automatic mapping to BLAS (Sect. 5). The latter three components of GENO’s software stack are of interest in a broader context than GENO and hence have been made available independently. GENO is available at www.genoproject.org [27].
2 Modeling Language
GENO’s modeling language uses a MATLABlike syntax for specifying optimization problems. MATLAB is a platform for numerical computations using matrices. The advantages of using matrix expressions are twofold: First, it allows the user to phrase an optimization problem without the need of specifying the number of variables nor the number of constraints. Hence, the generated solver is not tied to a specific instance but can handle arbitrarysized problems. Second, it enables direct mappings to BLAS routines that are much more efficient than the corresponding forloops.
A specification in GENO has four blocks:

1.
Declaration of the problem parameters that can be of type Matrix, Vector, or Scalar,

2.
declaration of the optimization variables that also can be of type Matrix, Vector, or Scalar,

3.
specification of the objective function, and finally

4.
specification of the constraints, also in a MATLABlike syntax that supports the following operators and functions: +, , *, /, .*, ./, \(^\wedge \), \(.^\wedge \), \('\), log, exp, sin, cos, tanh, abs, norm1, norm2, sum, tr, det, inv.
See Fig. 1 for some illustrative examples.
GENO’s modeling language also allows the specification of nonsmooth optimization problems, for instance, problems that employ the norm1 function, that is, the nonsmooth \(\ell _1\)norm. The nonsmooth optimization problems that are allowed by GENO can be written as \(\min _x \{\max _i f_i(x)\}\) with smooth functions \(f_i(x)\) [36], which is a fairly flexible class that accommodates many of the commonlyencountered nonsmooth objective functions. All problems within this class can be transformed into constrained, smooth problems of the form
The transformed problems are then solved by a solver for constrained, smooth optimization problems. Hence, within the GENO software stack only a solver for constrained, smooth optimization problems is needed. In the next section we describe the solver that is implemented in the GENO software stack.
3 Generic Optimizer
GENO’s generic optimizer employs a solver for unconstrained, smooth optimization problems. This solver is then extended to handle also constraints. The choice for the solver that is implemented within the GENO software stack is motivated by applications in machine learning. Optimization problems in machine learning typically exhibit a few dozen up to a few million variables, and the involved data matrices do not have any special structure and are typically not sparse, that is, at least 10% of the entries are nonzero entries. These properties exclude secondorder optimization algorithms and justify our choice to implement a slightly modified version of the LBFGSB algorithm [9, 44] that can handle smooth optimization problems that have no general constraints, except possibly bound constraints on the variables. It provides a good tradeoff between the number of iterations and the complexity per iteration. It also does not assume any structure on the problem data and it is numerically quite robust. On quadratic problems it shares the same convergence guarantees [22, 34] as Nesterov’s optimal gradient descent method [35] but compared to Nesterov’s method it is parameter free, i.e., no parameters need to be tuned or known for the specific problem.
3.1 Solver for BoundConstrained Smooth Problems
The solver for boundconstrained, smooth optimization problems combines a standard limited memory quasiNewton method with a projected gradient path approach. In each iteration, the gradient path is projected onto the box constraints and the quadratic function based on the secondorder approximation (LBFGS) of the Hessian is minimized along this path. All variables that are at their boundaries are fixed and only the remaining free variables are optimized using the secondorder approximation. Any solution that is not within the bound constraints is projected back onto the feasible set by a simple min/max operation [32]. Only in rare cases, a projected point does not form a descent direction. In this case, instead of using the projected point, one picks the best point that is still feasible along the ray towards the solution of the quadratic approximation. Then, a line search is performed for satisfying the strong Wolfe conditions [41, 42]. This ensures convergence also in the nonconvex case. The line search also removes the need for a predefined step length parameter. We use the line search proposed in [33] which we enhance by a backtracking line search in case the solver enters a region where the function is not defined.
3.2 Solver for Constrained Smooth Problems
There are quite a few options for solving smooth, constrained optimization problems. We decided to use the augmented Lagrangian approach [21, 38]. It allows to (re)use our solver for smooth, unconstrained problems, it is fairly robust, and does not need to tune any parameters. The augmented Lagrangian method can be used for solving the following general standard form of an abstract constrained optimization problem
where \(x\in \mathbb {R}^n\), \(f:\mathbb {R}^n \rightarrow \mathbb {R}\), \(h:\mathbb {R}^n\rightarrow \mathbb {R}^m\), \(g:\mathbb {R}^n\rightarrow \mathbb {R}^p\) are differentiable functions, and the equality and inequality constraints are understood componentwise.
The augmented Lagrangian of Problem (1) is the following function
where \(\lambda \in \mathbb {R}^m\) and \(\mu \in \mathbb {R}_{\ge 0}^p\) are the Lagrange multipliers, also known as dual variables, \(\rho >0\) is a constant, \(\left\ \cdot \right\ \) denotes the Euclidean norm, and \((v)_+\) denotes \(\max \{v, 0\}\). The augmented Lagrangian is the standard Lagrangian of Problem (1) augmented by a quadratic penalty term. The quadratic term provides increased stability during the optimization process which can be seen, for example, in the case that Problem (1) is a linear program.
The Augmented Lagrangian Algorithm 1 runs in iterations. Upon convergence, it will return an approximate solution x to the original problem along with an approximate solution of the Lagrange multipliers for the dual problem. If Problem (1) is convex, then the algorithm returns the global optimal solution. Otherwise, it returns a local optimum [6]. The update of the multiplier \(\rho \) can be ignored and the algorithm still converges [6]. However, in practice it is beneficial to increase it depending on the progress in satisfying the constraints [7]. If the infinity norm of the constraint violation decreases by a factor less than \(\tau =1/2\) in one iteration, then \(\rho \) is multiplied by a factor of two.
4 Matrix and Tensor Calculus
The solver at the core of GENO’s generic optimizer, an implementation of the LBFGSB algorithm for boundconstrained smooth problems, runs in iterations. In each iteration expressions for the objective function and its gradients are evaluated. Within GENO these expressions, especially the gradients, have to be made available to the solver. Expressions for objective functions are given in GENO’s modeling language that uses a vectorized notation, that is, a notation that avoids explicit indices. The advantage of a vectorized notation is that expressions can be mapped more or less directly to BLAS calls and thus to highly optimized BLAS implementations. For GENO we also want this advantage for the gradients. Hence, we need to compute derivatives of matrix expressions. Although computing derivatives of matrix and tensor expressions is a fundamental and frequent task, surprisingly, no algorithm existed that would solve this problem in the general case. In the following, we describe our approach [24, 28] that for the first time allowed to compute derivatives of general tensor expressions. It was shown in [24] that evaluating derivatives of nonscalar valued functions computed by this approach is two orders of magnitude faster than previous stateoftheart approaches when evaluated on the CPU and up to three orders of magnitude faster when evaluated on the GPU. An implementation of our approach is integrated into the GENO software stack. It is also available as a standalone tool at www.MatrixCalculus.org [26].
4.1 Problems with Matrix Notation
Computing derivatives for scalar functions, i.e., \(f(x):\mathbb {R}\rightarrow \mathbb {R}\) is a straightforward task and is taught already in high school. One just applies the chain rule repeatedly and the partial derivatives are multiplied together. For instance, consider the function \(f(x) = sin(x^2)\). Its derivative is \(f^\prime (x) = cos(x^2)\cdot 2\cdot x\). However, this no longer works in the matrix and tensor case. Compared to the scalar case where only one type of multiplication operator exists, there are several types of multiplication in the matrix and tensor case. It has been shown that 24 types of different multiplications are necessary for representing the derivatives of matrix expressions only in the linear case [37]. Hence, it is essential to find a good representation of matrix and tensor multiplications.
Furthermore, when computing derivatives of vector and matrix expressions, even matrix notation is not sufficient to express all derivatives. For instance, for function \(f(x):\mathbb {R}^n\rightarrow \mathbb {R}^m\), the derivative will be a matrix \(M\in \mathbb {R}^{m\times n}\). But already its second derivative will be \(T\in \mathbb {R}^{m\times n\times n}\), i.e., a third order tensor, which cannot be represented in standard matrix notation. One usually circumvents this by using the \({{\,\textrm{vec}\,}}\)operator that maps a matrix to a vector by stacking its columns on top of each other and using the Kronecker product. This way, one can flatten some dimensions and emulate higher order tensors and their multiplications. However, still not all necessary multiplications can be represented this way and it unnecessarily complicates the representation. And even in the twodimensional case, i.e., when the derivative is a tensor of order two, it might have no corresponding representation as a matrix. For instance, consider the simple quadratic function \(f(x) = x^\top Ax\), where \(x\in \mathbb {R}^n\) is a vector and \(A\in \mathbb {R}^{n \times n}\) is a matrix. When computing the derivative of f with respect to x using the chain rule, one has to compute the derivative of \(x^\top \) with respect to x, i.e., the derivative of a function that maps x to its transpose. This is not the identity matrix. In fact, it is not even representable as a matrix. In the more powerful Ricci notation [39] it would be written as the tensor \(\delta _{ij}\). Hence, the right representation of tensors and operators on them, especially multiplications between them is crucial. In fact, choosing the right representation has led to the first general and coherent matrix and tensor calculus theory [24]. Before, only a number of cases could be treated systematically. While the first theory used Ricci notation to represent tensors and their multiplications it turned out that using a generalized form of Einstein notation makes the process of computing derivatives even simpler and more coherent [28].
4.2 Einstein Notation
In tensor calculus one can distinguish three types of multiplication, namely inner, outer, and elementwise multiplication. Indices are used for distinguishing between these types. For tensors A, B, and C any multiplication of A and B can be written as
where C is the result tensor and \(s_1, s_2\), and \(s_3\) are the index sets of the left argument, the right argument, and the result tensor, respectively. The summation is over all indices that appear in at least one of the two multiplication’s arguments A and B and are not present in the result tensor C. The index set of the result tensor is always a subset of the union of the index sets of the multiplication’s arguments, that is, \(s_3\subseteq (s_1\cup s_2)\). In the following we denote the generic tensor multiplication as defined in Eq. (2) simply as
This notation is basically identical to the tensor multiplication einsum in NumPy, TensorFlow, and PyTorch, and to the notation used in the Tensor Comprehension Package [40].
Note, that the \(*_{(s_1, s_2, s_3)}\)notation comes close to standard Einstein notation. In Einstein notation the index set \(s_3\) of the output is omitted and the convention is to sum over all shared indices in \(s_1\) and \(s_2\). However, this restricts the types of multiplications that can be represented. The set of multiplications that can be represented in standard Einstein notation is a proper subset of the multiplications that can be represented by our notation. For instance, standard Einstein notation is not capable of representing elementwise multiplications directly. Still, in the following we refer to the \(*_{(s_1, s_2, s_3)}\)notation simply as Einstein notation as it is standard practice in many linear algebra packages.
4.3 Tensor Calculus
In the following, let \(\left\ A\right\ =\sqrt{\sum _s A[s]^2}\) denote the norm of a tensor A. For vectors it coincides with the Euclidean norm and for matrices with the Frobenius norm. The following definition generalizes the standard derivative to the multidimensional case.
Definition 1 (Fréchet Derivative)
Let \(f:\mathbb {R}^{n_1\times n_2\times \ldots \times n_k}\rightarrow \mathbb {R}^{m_1\times m_2\times \ldots \times m_l}\) be a function that takes an orderk tensor as input and maps it to an orderl tensor as output. Then, \(D\in \mathbb {R}^{m_1\times m_2\times ... \times m_l \times n_1\times n_2\times \ldots \times n_k}\) is called the derivative of f at x if and only if
where \(\circ \) is an inner tensor product.
Here, the dot product notation \(D\circ h\) is short for the inner product \(D*_{(s_1s_2, s_2, s_1)}h\), where \(s_1s_2\) is the index set of D and \(s_2\) is the index set of h. For instance, if \(D\in \mathbb {R}^{m_1\times n_1 \times n_2}\) and \(h\in \mathbb {R}^{n_1\times n_2}\), then \(s_1=\{i,j,k\}\) and \(s_2=\{j,k\}\).
With this definition at hand, we can compute derivatives of matrix and tensor expressions in Einstein notation. As noted in the beginning of this section, derivatives are usually computed using the chain rule. There are two major orderings in which we can apply the chain rule; in a forward fashion and in a reverse fashion. These ways are known as forward mode and reverse mode in the area of algorithmic differentiation (AD, aka. automatic differentiation) [18]. They will both result in the same derivative but not necessarily in the same expression for the derivative. The forward mode coincides with what is usually taught in high school and commonly refers to as symbolic computation of derivatives [23]. Here, we will only describe the reverse mode since this is the mode that is used within the GENO software stack.
Any expression can be represented as a directed acyclic expression graph (expression DAG). Figure 2 shows the expression DAG for the objective function of the logistic regression, i.e.,
where \(\odot \) denotes the elementwise multiplication.
The nodes of the DAG that have no incoming edges represent the variables or constants of the expression and are referred to as input nodes. The nodes of the DAG that have no outgoing edges represent the functions that the DAG computes and are referred to as output nodes. Let the DAG have n input nodes (variables) and m output nodes (functions). Note, that the DAG in Fig. 2 has only one output node. We label the input nodes as \(x_0, ..., x_{n1}\), the output nodes as \(y_0, ..., y_{m1}\), and the internal nodes as \(v_0,\ldots , v_{k1}\). Every internal and every output node represents an operator whose arguments are supplied by the incoming edges.
When evaluating the DAG, i.e., computing the function values that the DAG represents for some given input, one proceeds from the input nodes to the output nodes. In forward mode automatic differentiation one proceeds in the same direction for computing the derivative and in reverse mode in reverse order, i.e., from output to input nodes. Each node \(v_i\) will eventually store the derivative \(\frac{\partial y_j}{\partial v_i}\) which is usually denoted as \(\bar{v}_i\), where \(y_j\) is the function to be differentiated. This partial derivative is often referred to as adjoint. These derivatives are computed as follows: First, the derivatives \(\frac{\partial y_j}{\partial y_i}\) are stored at the output nodes of the DAG. Then, the derivatives that are stored at the remaining nodes, here called z, are iteratively computed by summing over all their outgoing edges as follows
where the multiplication is again tensorial. The following theorems specify the type of tensor multiplication for reverse mode Eq. (4). Their proofs can be found in [29].
Theorem 1
Let Y be an output node with index set \(s_4\) and let \(C = A *_{(s_1, s_2, s_3)} B\) be a multiplication node of the expression DAG. Then the contribution of C to the adjoint \(\bar{B}\) of B is \(\bar{C}*_{(s_4s_3, s_1, s_4s_2)}A\) and its contribution to the adjoint \(\bar{A}\) of A is \(\bar{C}*_{(s_4s_3, s_2, s_4s_1)}B\).
If the output function Y in Theorem 1 is scalarvalued, then we have \(s_4=\emptyset \) and the adjoint coincides with the function implemented in all modern deep learning frameworks including TensorFlow and PyTorch. Hence, our approach can be seen as a direct generalization of the scalar case.
Theorem 2
Let Y be an output function with index set \(s_3\), let f be a general unary function whose domain has index set \(s_1\) and whose range has index set \(s_2\), let A be a node in the expression DAG, and let \(C= f(A)\). The contribution of the node C to the adjoint \(\bar{A}\) is
where \(f^\prime \) is the derivative of f.
In case that the general unary function is simply an elementwise unary function that is applied elementwise to a tensor, Theorem 2 simplifies as follows.
Theorem 3
Let Y be an output function with index set \(s_2\), let f be an elementwise unary function, let A be a node in the expression DAG with index set \(s_1\), and let \(C= f(A)\) where f is applied elementwise. The contribution of the node C to the adjoint \(\bar{A}\) is
where \(f^\prime \) is the derivative of f.
Table 1 shows the individual steps of the reverse mode applied to the expression graph in Fig. 2. Note, that reverse mode manages to compute the derivative of the output function with respect to all input variables in one pass. Again, the last column shows the derivatives in matrix notation when a few simplifications have been applied, like removal of zero and identity tensors. From the first two rows we can read off the derivative of f with respect to X and the derivative with respect to w. The values of the intermediate results and common subexpressions \(v_1\) and \(v_2\) can be substituted again to obtain the final expression \(X^\top \cdot (y\odot \exp (Xw) \oslash \exp (Xw + 1))\). This expression can then be mapped very easily to a NumPy expression. In the next section, we will discuss how to map such expressions also to different hard and software backends.
5 autoBLAS
GENO aims at providing stateoftheart performance on a wide variety of backends that include multicore CPUs and GPUs. Hence, it is necessary to generate efficient code for all these backends. This is the purpose of autoBLAS. GENO does not need to directly compile the specification of an optimization problem into executable code but it can map it to an intermediate representation where linear algebra expressions are given as blocks of autoBLAS code. The autoBLAS precompiler then compiles the intermediate code into standard code for the specified backends. autoBLAS itself features an intuitive syntax for linear algebra expressions that is easy to read and comprehend, and delegates the details about their execution to highlyefficient implementations of BLAS routines for the respective backends like for instance AMD Blis [43], Intel MKL [10], Arm Performance Libraries [30], NVIDIA cuBLAS [11], and AMD clBLAS [4].
5.1 A Simple autoBLAS Example
For illustrating autoBLAS, we discuss a minimal example, namely, a matrixvector product. Listing 1.1 shows a snippet of C++ code initializing a set of std::vectors that represent vectors and matrices, followed by a pragmastyle declaration of an autoBLAS section. The autoBLAS section first declares two vectors x and y, and a matrix A. Each declaration comes with a set of namevalue pairs, like data=x.data() or rows=rows, that describe required properties for generating code to evaluate expressions of the associated variables. The set of supported names and restrictions on values lies in the responsibility of the selected hostlanguagecontext. Here, the Clanguage context has been chosen by setting c=c. The currently supported contexts are the Clanguage, the Eigen library, the NumPy library, and cuBLAS (CUDA).
The code in Listing 1.1 is, of course, no valid C++ code and cannot be compiled directly with a standard C++ compiler. In order to get hostlanguage code for the expressions that are stated in embedded autoBLAS sections, the autoBLAS precompiler has to be called first. Listing 1.2 shows how to invoke the autoBLAS precompiler on a file example.c.in. In this simple example, the b flag is set to select the target routines for the hostlanguage mappings, here, the standard Cbinding cblas.
In our specific example, the autoBLAS precompiler replaces the autoBLAS section in the hostlanguage file with a call to gemv, which is the BLAS routine that computes matrixvector products [1]. The generated code, shown in Listing 1.3, is now valid C++ code that can be passed to a conforming compiler like gcc.
If we want to generate code for the CUDA backend, then we just have to invoke the autoBLAS precompiler with autoblas b cuda< example.c.in> example.c. The generated code, shown in Listing 1.4, is now valid C++/Cuda code that again can be passed directly to a conforming compiler.
5.2 Design
By defining an embedded language of its own, autoBLAS is as intuitive to use as task specific frameworks like MATLAB when it comes to expressing what to compute. Additionally, by not being bound to a particular programming language, autoBLAS can perform any transformation and optimization necessary on the symbolic level at compile time, even beyond the scope of a single statement. Finally, autoBLAS delegates the task of how to evaluate the optimized expressions by generating the corresponding BLAS calls. This allows the user to utilize highly efficient BLAS implementations for the target platform without having to write these calls by hand.
Figure 3 illustrates the three abstract steps of the autoBLAS compiler. The frontend is the userfacing part of autoBLAS and comprises both the expression syntax as well as the context selection. The context specifies attributes of the variables like, for instance, their memory layout and the BLAS selections available at compile time.
The core implements a set of symbolic optimizations to increase execution performance at runtime, while also allocating and reusing memory for temporaries, if necessary. By performing these syntaxtree optimizations independent of a specific target API, autoBLAS provides a uniform evaluation semantic across different target platforms, hereby, minimizing unpleasant surprises like different operator semantics or optimization behavior when switching between libraries.
The backend generates code for the optimized expressions for the respective linear algebra library selected by the caller. Backends define a set of necessary attributes for evaluating expressions into code. For instance, for the cblas [1] backend, a dense matrix is often represented by a data pointer, a storage orientation, the number of rows and columns and the size of the leading dimension. A context is compatible with a specific backend if it provides all necessary attributes for a particular data type.
An advantage of selecting a BLASlike backend is, that when later profiling the code, the user is able to directly refer to potential bottlenecks. This is in contrast to templatebased libraries like Eigen, where the actually called routines are not directly visible and do not correspond to a particular line within the hostlanguage code.
A major benefit of the separation into frontend, core and backend is that extending autoBLAS with a new backend is rather simple and in practice merely requires to derive from a class and implement BLASexpression to target code mappings. At the same time, a developer who is extending autoBLAS in this way still benefits from all the symbolic optimizations implemented in the autoBLAS core.
6 Conclusions
Making generic optimization (GENO) work efficiently requires several fairly different interoperable software components. In this chapter we have described such components and their integration into the GENO software stack. By carefully designing, implementing and integrating the components in the GENO software we are able to generate optimization code that is competitive with problemspecific handwritten solvers and orders of magnitude faster than competing approaches that are comparably easy to use. Furthermore, the components, specifically, the generic optimizer, the matrix and tensor calculus, and autoBLAS, are of independent interest and are also used in other projects than GENO.
References
Blackford, L.S., et al.: An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw. 28(2), 135–151 (2002)
Agrawal, A., Verschueren, R., Diamond, S., Boyd, S.: A rewriting system for convex optimization problems. J. Control Decis. 5(1), 42–60 (2018)
Anderson, E., et al.: LAPACK Users’ Guide. 3rd edn. Society for Industrial and Applied Mathematics, Philadelphia, PA (1999)
various authors: clblas (2017). https://github.com/clMathLibraries/clBLAS. Accessed 21 Aug 2017
Banjac, G., Stellato, B., Moehle, N., Goulart, P., Bemporad, A., Boyd, S.P.: Embedded code generation using the OSQP solver. In: CDC, pp. 1906–1911 (2017)
Bertsekas, D.P.: Nonlinear Programming. Athena Scientific, Belmont, MA (1999)
Birgin, E.G., Martínez, J.M.: Practical augmented Lagrangian methods for constrained optimization, Fundamentals of Algorithms, vol. 10. SIAM (2014)
Brooke, A., Kendrick, D., Meeraus, A.: GAMS: release 2.25 : a user’s guide. The Scientific press series, Scientific Press (1992)
Byrd, R.H., Lu, P., Nocedal, J., Zhu, C.: A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 16(5), 1190–1208 (1995)
Corporation, I.: Intel math kernel library (2017). https://software.intel.com/enus/mkl. Accessed 21 Aug 2017
Corporation, N.: Nvidia cublas (2017). https://developer.nvidia.com/cublas. Accessed 21 Aug 2017
CVX Research Inc: CVX: Matlab software for disciplined convex programming, version 2.1 (2018). http://cvxr.com/cvx
Diamond, S., Boyd, S.: CVXPY: a Pythonembedded modeling language for convex optimization. J. Mach. Learn. Res. 17(83), 1–5 (2016)
Dunning, I., Huchette, J., Lubin, M.: JuMP: a modeling language for mathematical optimization. SIAM Rev. 59(2), 295–320 (2017)
Fourer, R., Gay, D.M., Kernighan, B.W.: AMPL: a modeling language for mathematical programming. Thomson/Brooks/Cole (2003)
Giselsson, P., Boyd, S.: Linear convergence and metric selection for DouglasRachford splitting and ADMM. IEEE Trans. Autom. Control 62(2), 532–544 (2017)
Grant, M., Boyd, S.: Graph implementations for nonsmooth convex programs. In: Blondel, V.D., Boyd, S.P., Kimura, H. (eds.) Recent Advances in Learning and Control. Lecture Notes in Control and Information Sciences, vol. 371, pp. 95–110. Springer, Cham (2008). https://doi.org/10.1007/9781848001558_7
Griewank, A., Walther, A.: Evaluating derivatives  principles and techniques of algorithmic differentiation, 2 edn. SIAM (2008)
Hart, W.E., et al.: PyomoOptimization Modeling in Python, vol. 67, 2nd edn. Springer, Cham (2017). https://doi.org/10.1007/9783319588216
Hart, W.E., Watson, J.P., Woodruff, D.L.: Pyomo: modeling and solving mathematical programs in Python. Math. Program. Comput. 3(3), 219–260 (2011)
Hestenes, M.R.: Multiplier and gradient methods. J. Optim. Theory Appl. 4(5), 303–320 (1969)
Huang, H.Y.: Unified approach to quadratically convergent algorithms for function minimization. J. Optim. Theory Appl. 5(6), 405–423 (1970)
Laue, S.: On the equivalence of forward mode automatic differentiation and symbolic differentiation. arXiv eprints abs/1904.02990 (2019)
Laue, S., Mitterreiter, M., Giesen, J.: Computing higher order derivatives of matrix and tensor expressions. In: NeurIPS, pp. 2755–2764 (2018)
Laue, S., Mitterreiter, M., Giesen, J.: GENO  generic optimization for classical machine learning. In: NeurIPS, pp. 2187–2198 (2019)
Laue, S., Mitterreiter, M., Giesen, J.: Matrixcalculus.org  computing derivatives of matrix and tensor expressions. In: ECMLPKDD, pp. 769–772 (2019)
Laue, S., Mitterreiter, M., Giesen, J.: GENO  optimization for classical machine learning made fast and easy. In: AAAI, pp. 13620–13621 (2020)
Laue, S., Mitterreiter, M., Giesen, J.: A simple and efficient tensor calculus. In: AAAI, pp. 4527–4534 (2020)
Laue, S., Mitterreiter, M., Giesen, J.: A simple and efficient tensor calculus for machine learning. Fund. Inform. 177(2), 157–179 (2020)
Limited, A.: Arm performance libraries  optimized blas, lapack and fft (2017). https://developer.arm.com/products/softwaredevelopmenttools/hpc/armperformancelibraries. Accessed 21 Aug 2017
Mattingley, J., Boyd, S.: CVXGEN: a code generator for embedded convex optimization. Optim. Eng. 13(1), 1–27 (2012)
Morales, J.L., Nocedal, J.: Remark on “algorithm 778: LBFGSB: fortran subroutines for largescale bound constrained optimization”. ACM Trans. Math. Softw. 38(1), 7:1–7:4 (2011)
Moré, J.J., Thuente, D.J.: Line search algorithms with guaranteed sufficient decrease. ACM Trans. Math. Softw. 20(3), 286–307 (1994)
Nazareth, L.: A relationship between the BFGS and conjugate gradient algorithms and its implications for new algorithms. SIAM J. Numer. Anal. 16(5), 794–800 (1979)
Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence \(O(1/k^2)\). Doklady AN USSR (translated as Soviet Math. Docl.) 269 (1983)
Nesterov, Y.: Smooth minimization of nonsmooth functions. Math. Program. 103(1), 127–152 (2005)
Olsen, P.A., Rennie, S.J., Goel, V.: Efficient automatic differentiation of matrix functions. In: Forth, S., Hovland, P., Phipps, E., Utke, J., Walther, A. (eds.) Recent Advances in Algorithmic Differentiation. Lecture Notes in Computational Science and Engineering, vol. 87, pp. 71–81. Springer, Cham (2012). https://doi.org/10.1007/9783642300233_7
Powell, M.J.D.: Algorithms for nonlinear constraints that use Lagrangian functions. Math. Program. 14(1), 224–248 (1969)
Ricci, G., LeviCivita, T.: Méthodes de calcul différentiel absolu et leurs applications. Math. Ann. 54(1–2), 125–201 (1900)
Vasilache, N., et al.: Tensor comprehensions: frameworkagnostic highperformance machine learning abstractions. arXiv preprint arXiv:1802.04730 (2018)
Wolfe, P.: Convergence conditions for ascent methods. SIAM Rev. 11(2), 226–235 (1969)
Wolfe, P.: Convergence conditions for ascent methods. ii: some corrections. SIAM Rev. 13(2), 185–188 (1971)
Van Zee, F.G., van de Geijn, R.A.: BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw. 41(3), 14:114:33 (2015)
Zhu, C., Byrd, R.H., Lu, P., Nocedal, J.: Algorithm 778: LBFGSB: fortran subroutines for largescale boundconstrained optimization. ACM Trans. Math. Softw. 23(4), 550–560 (1997)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2022 The Author(s)
About this chapter
Cite this chapter
Giesen, J., Kuehne, L., Laue, S. (2022). The GENO Software Stack. In: Bast, H., Korzen, C., Meyer, U., Penschuck, M. (eds) Algorithms for Big Data. Lecture Notes in Computer Science, vol 13201. Springer, Cham. https://doi.org/10.1007/9783031215346_12
Download citation
DOI: https://doi.org/10.1007/9783031215346_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783031215339
Online ISBN: 9783031215346
eBook Packages: Computer ScienceComputer Science (R0)