1 Introduction

Neural networks are commonly used in AI applications ranging from computer vision, autonomous driving, robotics, games, intelligent production, systems biology, human-computer-interaction to even arts or music. Nearly all existing methods are commonly trained using gradient descent-based optimization and auto-differentiation, so that such optimized systems can be summarized with the term differentiable programming, see Baydin et al. [8]. It is well known that the gradient descent-based optimization is prone to convergence problems, overfitting or getting stuck in local minima, so that many different (differential) approaches exist to overcome these issues, e.g., using special optimizers (adam, sgdm, etc.), drop-out layers, etc.

For these reasons, alternative approaches, which allow a global optimization, gained increased attention in the past, as outlined in Section 2. This work follows this research strand by modeling and optimization of a shallow neural network, which consists of non-differentiable neurons. The perceptron model, as it has been originally formulated by Rosenblatt in 1958, is used. As this formulation is not differentiable, this work explains how this model can be mapped to a mixed integer linear program (MILP) for given training data in a supervised learning setting. The network weights can then be (globally) optimized using the simplex algorithm and branch-and-bound or branch-and-cut. The drawback of the proposed method is that the formulation is reasonable memory intensive. It requires many slack variables since all training data and computations through the network are expressed as equality and inequality constraints. Thus, in the current stage the proposed method is mainly suited for small sized datasets.

The optimized network weights can then be assembled to a neural network and used for forward inference on unseen data. The proposed formulation also allows the neat integration of additional constraints on the network to enforce sparsity, efficiency and compactness. In the experiments, examples on enforcing binary or sparseness constraints are presented, which can be quite problematic using traditional differentiable programming. Additionally, a kill-switch on neurons to minimize the amount of required neurons during training is presented and finally, the amount of neural activations can also be minimized. Critical parameters in differentiable programming such as random seeds, see Picard [58], dropouts, learning rates, etc. do not exist in the proposed optimization framework. On different datasets competitive performance of the globally optimized sparse MILP neural networks is shown, even though the networks are rather simple and shallow.

To summarize, this work presents the following contributions:

  1. 1.

    In line with recent literature, the formulation and optimization of a non-differentiable Rosenblatt perceptron and a corresponding shallow neural network as a MILP for given training data, is presented.

  2. 2.

    This work presents modifications to prior art in MILP models of neural networks by incorporating sparsity constraints on the weights, activations and used neurons. This is useful for explainable AI, the challenge of feature selection and resource optimization.

  3. 3.

    Empirical experiments on a number of datasets, in particular including a study on feature selection, are presented.

  4. 4.

    Software examples for MILP-based optimization of neural networks are provided.Footnote 1

2 Foundations and Literature

2.1 The Perceptron

In 1943, McCulloch and Pitts formulated their idea for logical calculus using concepts from nervous activities, see McCulloch and Pitts [49]. A McCulloch-Pitts cell with n exciting input lines on which the signals \((x_1 \ldots x_n)\) are applied, and m inhibiting input lines on which the signals \((y_1 \ldots y_m)\) are applied, calculation works as follows: If \(m \ge 1\) and if one of the signals \((y_1 \ldots y_m)\) equals 1, the neuron outputs a 0. Otherwise (which is the standard case) the input signals \((x_1 \ldots x_n)\) are added to up to \(x=\sum x_i\). For \(n = 0\), \(x = 0\) is set. The value x is compared to the threshold \(\theta \). If the values x is greater than or equal to \(\theta \), the neuron returns 1, otherwise it returns 0. In 1958, Frank Rosenblatt published his perceptron model which extends the summation to a scalar product, followed by a step function, see Rosenblatt [60]. The perceptron can be summarized as

$$\begin{aligned} \sigma _j = \left\{ \begin{array}{l} 1: \sum _i \omega _{ij}x_i+b > 0\\ 0: \text {else} \end{array}. \right. \end{aligned}$$

The bias value b corresponds to the decision threshold and \(\omega _{ij}\) are learnable parameters. A combination of such perceptrons in a directed acyclic graph leads to a classic (e.g., fully connected) neural network.

2.2 Mixed Integer Linear Programming

Linear programming (LP) is a method for the minimization (or maximization) of a linear objective function, subject to linear equality or inequality constraints, see Dantzig [17]. Any LP can be expressed in a canonical form as

$$\begin{aligned} \min \, c^Tx \,\;&\text {s.t.}&Ax \le b, \hspace{0.2cm} x \ge 0. \end{aligned}$$

A standard approach for solving such a LP is the simplex algorithm. Note, that standard solvers also accept equality constraints of the form \(A_\textrm{eq}x=b_\textrm{eq}\) which are directly transformed in two inequality constraints of the form \(A_\textrm{eq}x\le b_\textrm{eq}\) and \(-A_\textrm{eq}x\le -b_\textrm{eq}\). Thus, the constraint matrix A refers to the inequality constraints and \(A_\textrm{eq}\) to the equality constraints for better readability. It is also possible to allow for the optimization of integer constraints, see Murty [52] or Dennis et al. [18], which is then called a mixed integer linear program (MILP). The solvers then use concepts such as branch-and-cut or branch-and-bound.

A (MI)LP is a very powerful tool, which allows the efficient optimization of graph problems, e.g., graph cut, graph matching, max-flow or network optimization, see Almohamad et al., Amaldi et al. or Komodakis et al. [1, 2, 36], logic inference, see Makhortov et al. or Krishnan [37, 46] and the optimization of NP-hard problems, such as the traveling salesman problem by using iterative subtour elimination. Here an efficient branching of the MILP prevents exploring sub-optimal solution spaces. Paul [56] gives a comprehensive introduction to formulate logic calculus (and beyond) using MILPs. It is also possible to optimize decision trees as proposed by Zhu et al. [70]; or support vector machines, see Nguyen et al. [55], using respective LP formulations.

2.3 Trained Neural Networks and MILP

Already trained neural networks and mixed integer linear programming have been brought successfully together in the past. Note, that this work is focussing on the direct optimization of the network weights and its parameters from training data with a close connection to the works presented in Section 2.5. But since the combination of trained neural networks with MILP achieved very important results on network validation, verification and adversarial robustness, the following subsection summarizes some recent works in this field.

Neural networks and (mixed integer) linear programming have been brought together in the context of network fooling and model checking as proposed by Heo et al. and Modas et al. [31, 51]. Already trained networks can be analyzed using LPs, see Anderson et al. [3] and there exist works where neural networks are proposed to solve linear programs themselves, see Liu et al. [41]. Thiago et al. [64] introduce an approach to remove unneccessary units and layers of a neural network while not changing its output using a MILP. It therefore allows lossless compression of an existing network. Fischetti et al. [20] model a pertained DNN as a 0-1 Mixed Integer Linear Program (0-1 MILP) where the continuous variables correspond to the output values of each unit and a binary variable is associated with each rectified linear unit (ReLU). The paper discusses the peculiarity of such 0-1 MILP models and presents a bound-tightening technique to ease its optimization, e.g., for feature visualization and in the construction of adversarial examples. Grimstad et al. [23] consider the embedding of piecewise-linear deep neural networks (ReLU networks) as surrogate models in mixed integer linear programming (MILP) problems. Similar to other works, the formulation is obtained by programming each ReLU operator with a binary variable and applying the big-M method. As the efficiency of the formulation hinges on the tightness of the bounds defined by the big-M values, the presence of output bounds can be exploited in bound tightening. To this end, the authors present several bound tightening procedures that consider both input and output bounds. In Schweidtmann et al. [61], the training of neural networks and machine learning (ML) models are regarded as optimization problem where model parameters are varied to minimize the model error on the training data. Then, a subsequent optimization problem can be solved to identify the experimental conditions (i.e., inputs of the ML model) that maximize the experimental performance (i.e., the output of the ML model). This task is a subsequent optimization problem where the optimization contains an already trained ML model. In [66], the authors Tjandraatmadja et al. improve the effectiveness of propagation- and linear-optimization-based neural network verification algorithms by proposing a tightened convex relaxation for ReLU neurons. Based on this relaxation, the authors present two polynomial-time algorithms for neural network verification.

2.4 Software

In the recent years, several highly efficient software packages for optimization and machine learning have been proposed. The software package JANOS, proposed by Bergmann et al. [10], is a framework for optimization of variables under constraints. It supports linear regression, logistic regression and neural networks with rectified linear activation functions. The optimizer can only act on already trained neural networks. The optimization and machine learning toolkit (OMLT) introduced by Ceccon et al. in [14] is an open-source software package which also works for already trained neural network and gradient-boosted tree surrogate models. This package allows to make use of these models for larger optimization problems with applications, e.g., on maximizing a neural acquisition function or verifying neural networks. With reluMip, see Lueg et al. [44], it is possible to generate MILP models of trained artificial neural networks (ANNs) using the rectified linear unit (ReLU) activation function. At the moment, only TensorFlow sequential models are supported. OptiCL, see Maragno et al. [45], is an end-to-end framework for mixed integer optimization (MIO) with data-driven learned constraints. It establishes a methodological foundation for mixed integer optimization with learned constraints. The software proposes an end-to-end pipeline for data-driven decision making in which constraints and objectives are directly learned from data using machine learning (e.g., multi-layer perceptrons). Afterwards, the trained models are embedded in an optimization formulation. Finally it allows the capture of various underlying relationships between decisions, contextual variables, and outcomes. Scikit-learn [57] is a well known Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised ML problems. It supports tools for classification, regression, clustering, dimensionality reduction, model selection and preprocessing. It is based on standard methods of optimization, e.g., back-propagation-based neural network training based on autodiff.

Note, that most of these packages focus on the user interface and the optimization pipelines. These packages mainly support pre-trained networks and MILPs and do not focus on resource optimized models, as proposed in this work.

2.5 Neural Networks and MILP

The formulation of a linear threshold unit (LTU), which is similar to the perceptron described above, as an LP has been presented by Mangasarian already in 1993 [47]. Based on this formulation, several LTUs are iteratively connected using a so-called multisurface method tree. A significant difference to this work is, that the complete network as one MILP is optimized. The proposed model is not optimized in an iterative fashion. The special case of binary neural networks using MILP has been presented by Icarte et al. in [33]. Interestingly, binarized neural networks (BNNs) which are neural networks with weights and activations in \((-1,1)\) not only can gain comparable test performance to standard neural networks, but allow for highly efficient implementations on resource limited systems as shown by Hubara et al. in 2016 [32]. The work by Bienstock et al. [11] establishes a general framework to reformulate (regularized) empirical risk minimization problems in machine learning into approximate linear programs with explicit bounds on their complexity. The work of Kronqvist et al. [38] presents relaxations in between the big-M and convex hull formulations of disjunctions and draws advantages from both. The authors propose so-called P-split formulations to split convex additively separable constraints into P partitions and form the convex hull of the partitioned disjuncts. Experiments are performed on K-means clustering and neural networks with ReLU activation function. As the intermediate P-split formulations can form strong outer approximations of the convex hull with fewer variables and constraints than the extended convex hull formulations it can give significant computational advantages. Lu et al. [43] analyze the expressive power of neural networks which is important for understanding deep learning. The authors study how width affects the expressiveness of neural networks. Based on the knowledge that depth-bounded (e.g., depth-2) networks with suitable activation functions are universal approximators, the authors show a universal approximation theorem for width-bounded ReLU networks. Already width-(\(n + 4\)) ReLU networks, where n is the input dimension, are universal approximators. In [65] Thorbjarnarson et al. use a mixed integer linear program to optimize neural network weights. There, also the optimization on the amount of neurons has been presented (and named as model compression). In contrast, this work additionally optimizes the sparsity on model weights, integer weights and the amount of neural activations.

There are two arguments why a MILP solution may be disadvantageous, as stated by Gambella et al. [22] (a) scaling to large datasets is problematic, as the size of the generated equations scale with the size of the training set and (b) solutions with provable optimal training error can overfit, thus training data is perfectly explained, but test performance is much worse. This paper focusses on the second challenge by the proposed sparsity constraints and resource optimization. The first challenge is confirmed in the experiments and summarized, e.g., in Table 5.

2.6 Feature Selection

Feature selection, see e.g., Guyon et al. [25] involves the task of measuring contributions of individual input features to the performance of a supervised learning task. It is an important tool (among others) for explainable/interpretable AI see Wojciech et al. [67]. An example application for feature selection is to understand from genomics data which genes or gene combinations are likely to cause a specific disease.

Feature selection has a reasonably long history with several competitions, such as the NIPS feature selection challenge, see Guyon et al. [27]. At this time, Bayesian networks and decision trees produced state-of-the art results. Due to the black-box behavior of neural networks, teaching a deep neural network to optimize on relevant features for decision making is quite challenging, see Romero et al. [59]. The work by Wojtas et al. [68] proposes to address feature importance ranking by using a dual-net architecture consisting of an operator and selector network. Ye et al. [69] propose collaborative feature selection based on an intermediate representation and subspace projection in the context of distributed learning. In [34] Ji et al. propose a particle swarm-based optimization for feature selection. The work from Ayindel et al. [5] proposes networks with non-negative weights and demonstrates increased sparsity and competitive performance on many tasks.

In contrast to existing works, the proposed MILP formulation can easily respect constraints on sparsity. Additional constraints can further be used to enforce positive or even binary weights. Therefore, the proposed MILP formulation for optimizing weights of a neural network will allow an efficient feature selection during model optimization as shown in the experiments.

2.7 Resource Optimized Neural Networks

Optimizing neural network architectures means to optimize in a huge search space. Common approaches are based on reinforcement learning, structural search or performance prediction, as presented by Baker et.al, Cai et al. or Negrinho et al. [7, 12, 54]. Neural Architecture Search (NAS) is another common framework which automates the task of designing a neural network, especially the topology and the amount of used neurons across different layers, see Han et al. or Liu et al. [29, 40]. Still, all architecture search algorithms are computationally demanding despite their remarkable improvements over the last years. In this work a so-called kill-switch on neurons is proposed (see Fig. 4). Additionally, minimization on the neural activations is proposed. Both modifications fit well to the challenge of resource optimized neural networks [6, 9, 39, 53], as the resulting MILP formulation allows for optimizing a model to enforce sparsity on activations and additionally to minimize the amount of required neurons. As mentioned before, the constraints for minimizing the amount of neurons have also been proposed by Thorbjarnarson et al. [65] where a MILP is used for joint network training and compression, but only in combination with sparseness constraints on weights or activations (see also [19]) is the full potential exploited.

3 Modeling a Rosenblatt Perceptron as MILP

The Rosenblatt perceptron requires the computation of a scalar product, followed by a step function. This can also be expressed as an if-else condition and greater as, >, test. The formulations can be expressed as MILP using a so-called Big-M formulation. Note, that the letter M refers to a (sufficiently) large number associated with the artificial variables, see Paul [56].

The code fragment

if ( \(\delta \) ) then c=d else c=e

can be expressed in an MILP as follows:

$$\begin{aligned} d-M(1 - \delta )&\le c \le&d + M (1 - \delta ) \end{aligned}$$
$$\begin{aligned} e - M \delta&\le c \le&e + M \delta \end{aligned}$$
$$\begin{aligned} \delta\in & {} \{0,1\}. \end{aligned}$$

The verification of this expression is quite simple:

$$\begin{aligned} \delta =1 \rightarrow \hspace{0.3cm} d-M(1-\delta )&\le c \le&d + M (1 - \delta ) \\ \leftrightarrow d&\le c \le&d \\ \rightarrow c= & {} d \\ \text {and} \hspace{0.3cm} e - M \delta&\le c \le&e + M \delta \hspace{0.2cm} (\text {trivial}) \\ \delta =0 \rightarrow \hspace{0.3cm} d-M(1-\delta )&\le c \le&d + M (1 - \delta ) \\ \leftrightarrow d-M&\le c \le&d + M \hspace{0.2cm}(\text {trivial})\\ \text {and} \hspace{0.3cm} e - M \delta&\le c \le&e + M \delta \\ \leftrightarrow e&\le c \le&e \\ \rightarrow c= & {} e. \end{aligned}$$

The condition

a >b \(\leftrightarrow \delta \) =1

can be expressed as

$$\begin{aligned} a\ge & {} b+ \epsilon - M (1 - \delta ) \end{aligned}$$
$$\begin{aligned} a\le & {} b + M \delta \end{aligned}$$
$$\begin{aligned} \delta\in & {} \{0,1\}. \end{aligned}$$

The verification of this expression can be done in a similar fashion to equations (3)-(5). Based on equation (1), both expressions can be put together for modeling the activation of a perceptron with weights \(\omega \), bias b and input x as

$$\begin{aligned} \texttt{if} \left( \left( \omega ^T x + b \right) > 0 \right)&\texttt{then}&\mathtt{1\, else\, 0}. \end{aligned}$$

Please note, that the outcome is a binary vector which is later denoted as \(\text {st}_i\). It can be interpreted as an input impulse of a transistor which is switching on/off a memorized value as input to the next layer, as visualized in Fig. 1.

Fig. 1
figure 1

Visualization of the neural network using separate building blocks of scalar products, summation operations, step functions and switches. This example network consists of two input values \((X_1, X_2)\), three hidden neurons (marked as a yellow box) with weight matrix \((\omega _{i,j})\) and bias \(b_i\) to generate the scalar product value \(sp_i\) followed by a step function \(\text {st}_i\). For the next layer, three switches pass over memorized values to a final layer with (in this case) two summation and step function operations for a 2-dimensional binary decision output

Thus, for an output layer, the matrix vector product of the weights with the input reduces to an addition operation of memorized values so that the next perceptron can be expressed as MILP as well,

$$\begin{aligned} \texttt{if}\left( \left( \sum _{\text {st}_i=1} q_i+b_F \right) > 0 \right)&\texttt{then}&\mathtt{1\, else\, 0}. \end{aligned}$$

Please note, that such a perceptron can be interpreted as a neuron which performs neural memory allocation based on memorized values (\(q_i\)), see Kastellakis et al. [35].

As classification loss function for the MILP formulated neural network, the xor-error of the network outcome to each dimension of the one-hot encoded target vector is used. The xor-function can be modeled as MILP using

$$\begin{aligned} z\le & {} x+y \end{aligned}$$
$$\begin{aligned} z\ge & {} x-y \end{aligned}$$
$$\begin{aligned} z\ge & {} y-x \end{aligned}$$
$$\begin{aligned} z\le & {} 2-x-y \end{aligned}$$
$$\begin{aligned} x,y,z\in & {} \{0,1\}. \end{aligned}$$

Taking the input and output pair of the vector x and the one-hot encoded target y, a set of equations simulating the forward path for every training example is generated.

3.1 A Neural Network as MILP

Figure 1 visualizes the different parts required to formulate a training sample for a classification task as constraints for an MILP: The input is a m-dimensional input vector \(X=(X_1, X_2, \ldots X_m )^T\) followed by n hidden neurons and a K-dimensional output. In the example illustration in Fig. 1, the input dimension m is set to two, the amount of neurons in the hidden layer n is set to three and the output dimension K is also set to two. The connection of the input vector to the hidden layer is a matrix–vector product involving the weights \(\omega _{ij}\). The index i indicates the number of the neuron and j the vector dimension. Then the bias values \(b_i\) for each neuron are added to the product and stored in the three slack variables \(sp_i\). Then the step function is evaluated and the 0/1 values stored in the slack variables \(\text {st}_i\). Afterwards, the weights \(q_{i,k}\) and the values of the step function \(\text {st}_i\) are evaluated and stored in a slack variable \(p_{i,k}\). This can be seen as the input for the next layer, which is already the K-dimensional output layer. In each output perceptron, the sum of \(p_{i,k}\) is computed, the bias \(b_{F,k}\) added and stored in \(sp_{F,k}\). Then a step function is evaluated and stored in the variable \(\text {st}_{F,k}\) and finally the xor-value (\(E_r\)) to the ground truth (GT) is computed for each output value as loss.

Whereas the learning parameters \(\omega _{ij}\), the bias values \(b_i\), \(b_{F,k}\) and the weights \(q_{i,k}\) are simultaneously accessed for all training examples, the slack variables are unique for each training example used during optimization. The above steps generate a set of equality and inequality constraints, as well as integer conditions (for the step functions) which can be more concretely formalized by using the above MILP-formulations:

$$\begin{aligned} \min f^Tx&s.t.&\end{aligned}$$
$$\begin{aligned} {\text {Scalarproduct}}{} & {} \nonumber \\ \forall i=1 \ldots n \hspace{0.5cm} (17-23){} & {} \nonumber \\ \sum _{j=1}^{m} X_j \omega _{ij}+b_i -sp_i= & {} 0 \end{aligned}$$
$$\begin{aligned} {\text {boolean var for step function}}{} & {} \nonumber \\ \epsilon -M(1-\text {st}_i) - sp_i\le & {} 0 \end{aligned}$$
$$\begin{aligned} sp_i - M \text {st}_i\le & {} 0 \end{aligned}$$
$$\begin{aligned} (\forall k=1 \ldots K){} & {} \nonumber \\ {\text {if-else condition}}{} & {} \nonumber \\ q_{i,k}-M(1-\text {st}_i)-p_{i,k}\le & {} 0 \end{aligned}$$
$$\begin{aligned} p_{i,k}-q_{i,k}-M(1-\text {st}_i)\le & {} 0 \end{aligned}$$
$$\begin{aligned} -M \text {st}_i-p_{i,k}\le & {} 0 \end{aligned}$$
$$\begin{aligned} p_{i,k} -M \text {st}_i\le & {} 0 \end{aligned}$$
$$\begin{aligned} {\text {summation 2nd layer}}{} & {} \nonumber \\ \sum _{i=1}^{n} p_{i,k} + b_{F,k} -sp_{F,k}= & {} 0 \end{aligned}$$
$$\begin{aligned} {\text {Step Function (classification task)}}{} & {} \nonumber \\ \epsilon -M(1-\text {st}_{F,k}) - sp_{F,k}\le & {} 0 \end{aligned}$$
$$\begin{aligned} sp_{F,k} - M \text {st}_{F,k}\le & {} 0 \end{aligned}$$
$$\begin{aligned} {\text {xor (classification task)}}{} & {} \nonumber \\ E_{r,k} - \text {st}_{F,k} - \text {GT}_k\le & {} 0 \end{aligned}$$
$$\begin{aligned} -E_{r,k} + \text {st}_{F,k}- \text {GT}_k\le & {} 0 \end{aligned}$$
$$\begin{aligned} -E_{r,k} + \text {GT}_k -\text {st}_{F,k}\le & {} 0 \end{aligned}$$
$$\begin{aligned} E_{r,k} - 2+\text {st}_{F,k}+ \text {GT}_k\le & {} 0 \end{aligned}$$
$$\begin{aligned} {\text {classification integer conditions}}{} & {} \nonumber \\ \text {GT}_k, E_{r,k}, \text {st}_{F,k}, \text {st}_i \in \{0,1\}.{} & {} \end{aligned}$$

All variables can be ordered as a large vector and accessed via an index-operation \(\text {ind}(.)\). The equations lead to sparse vector expressions and describe an integer linear program which can be optimized with standard tools, e.g., based on Gurobi [24].

Fig. 2
figure 2

Visualization of the target vector f which encodes the xor-error and the equality and inequality constraints for the respective network weights and slack variables for the training data

Fig. 3
figure 3

Size of the equality and inequality constraint matrices. The x-axis gives the number of sample points (training data). The five curves show the increase for increasing numbers of neurons (from 1 to 5). The blue curve shows the size of the matrices, whereas the red curve displays the memory consumption for the variables. Even though the size is increasing nonlinearly, due to the use of sparse memory representations for the matrices, the increase in memory is nearly linear. (Best viewed in color and zoom in)

Figure 2 summarizes the resulting structure of the target vector f, the network parameters and the slack variables for each training sample on a classification task (using the xor-error). Minimizing the objective function f means to minimize the xor-error for all training examples and therefore the estimation of appropriate neural network weights. Thus, f is a vector containing only 0-values, except for the index holding the error \(E_{r,k}\), e.g., \(f(\text {ind}(E_{r,k}))=1\). After optimization, the weights can be assembled to a neural network for forward inference on test data.

Figure 3 shows the size of the equality and inequality constraint matrices. The x-axis gives the number of sample points (training data). The five curves show the increase for increasing numbers of neurons (from 1 to 5). In this experiment, the dimension of the input and output is kept constant (to the value 2). These parameters also add up to the behavior of the curves for more complex settings. The blue curve displays the size of the matrices, whereas the red curve shows the memory consumption for the variables. Even though the size is increasing polynomially, due to the use of sparse memory representations for matrices, the increase in memory is nearly linear. Note, that the amount of training data and the required slack variables can be a severe memory limitation for large sized problems. In the experiments, see Table 5, already smaller sized datasets lead to optimization problems of noticeable dimension, e.g., the ovarian dataset of dimension \(100 \times 4000\) results for a neural network with up to 8 neurons in an inequality matrix A of size \(73.148 \times 68.172\). Thus, it is mandatory to use sparse memory representations for both, the problem encoding and the optimization. Please also note, that it is also possible to generate deeper fully connected networks with this approach. As the equations are getting even larger, this will be investigated as part of future work.

Fig. 4
figure 4

Visualization of the different blocks to enforce sparsity: Sparsity can be enforced for the network weights \(\omega _{i,j}\), the amount of activations can be minimized \(\text {st}_{i}\) or neurons can be completely switched off by adding a switch \(\text {KS}_{i}\)

3.2 Sparsity Constraints

As already discussed by Gambella et al. [22], a globally optimized model explaining the training data can perform bad on test data, due to overfitting. Thus, it is important to reduce the complexity of the model by enforcing sparsity on different involved components: Fig. 4 visualizes different parts of the neurons which can be modified together with the objective function to enforce sparsity: (1) Minimizing the weights \(\omega _{i,j}\) and enforcing small, but positive integer values allows to optimize integer or binary networks. It requires to add \(\omega _{i,j}\) to the objective function f and to enforce positive integer numbers on these weights. Additionally it is possible to limit the amount of weighting factors by adding an upper limit on the sum of weights \(N_{\max }\). This allows, e.g., to enforce sparse binary weight matrices. The sparse matrices finally allow a semantic interpretation about relevant features for decision making. To balance the constraints and the overall error of the loss function, the scaling factor is empirically set to the value 0.1 for the experiments. This ensures the main focus is on the minimization of the classification error. Based on the earlier MILP, the following additional constraints can be added to enforce this property of sparse positive weight matrices:

$$\begin{aligned} f(\text {ind}(\omega _{i,j}))= & {} 0.1 \end{aligned}$$
$$\begin{aligned} \sum {\omega _{i,j}}\le & {} N_{\max } \end{aligned}$$
$$\begin{aligned} \omega _{i,j}\in & {} {\mathbb {N}}^+. \end{aligned}$$

As mentioned above, all variables used are assumed to be ordered and their positions are accessible via an index-operation \(\text {ind}(.)\).

(2) The second option to minimize the model is the kill switch (\(\text {KS}_i\)) which allows the minimization of the amount of neurons used. It is a simple if-else condition on the evaluated step function:

$$\begin{aligned} \text {st}_i-M(1-\text {KS}_i)-p_i\le & {} 0 \end{aligned}$$
$$\begin{aligned} p_i-\text {st}_i-M(1-\text {KS}_i)\le & {} 0 \end{aligned}$$
$$\begin{aligned} -M \text {st}_i-p_i\le & {} 0 \end{aligned}$$
$$\begin{aligned} p_i -M \text {KS}_i\le & {} 0 \end{aligned}$$
$$\begin{aligned} \text {KS}_i\in & {} \{0,1\}. \end{aligned}$$

The optimization function f needs to be modified with \(f(\text {ind}(\text {KS}_i))=0.1\).

This variant for optimization has also been described by Thorbjarnarson et al. in [65] and called model compression.

(3) The final option for sparsity is minimizing the amount of activations (at the step functions \(\text {st}_i\)). It allows to minimize the neural effort for decision making. This is valid under the assumption, that the generation of a nervous activation is accompanied with a certain amount of energy. For this, the f-vector of the objective function needs to be modified by

$$\begin{aligned} f(\text {ind}(\text {st}_{i}))= & {} 0.1. \end{aligned}$$

Thus, for example the sparsity of neural activations can be enforced by adding a small positive value to the objection function f at the positions of the evaluated step-functions on the if-else conditions \(\text {st}_i\).

Even though the modifications appear simple and redundant, they have a major impact on the overall performance of the optimized models. Note, that the constraints are not exclusive, thus they can also be jointly optimized if desired. Even though they are simple to express as an MILP, they are very hard to optimize using differentiable programming. The effects of these different combinations of sparsity constraints are evaluated in detail in Section 4.3. The weighting factors (0.1) have been empirically chosen with the intention to have the main emphasis on the explanation of the training data. The optimization of the sparseness is important, but a subordinate task. The balancing factors across the sparsity variants can also be jointly optimized. This could be achieved, e.g., by using Bayesian optimization [62]. But it requires one to optimize and evaluate several MILPs, which can be very time consuming.

4 Experiments

Data loading and MILP formulation is implemented in MATLAB [48] and the integer linear program is optimized using Gurobi, see [24]. For this work, MATLAB version 2021b and the Gurobi version 9.1.2 build v.9.1.2rc0 (linux64) have been used. The optimization parameters for the MILP (presolving, rounding, diving, etc.) have not been changed throughout all experiments (default parameters are used). Thus, the same optimization problem leads to the same result, unless heuristics cause a different branching behavior. The optimizer stops once a global optimal solution is obtained or the maximum amount of computation time has been reached. This usually leads to a feasible solution which is accepted for evaluation. All results have been achieved in a Linux environment on Xeon Gold 5120 CPUs with 10 cores allocated for parallel computing. More recent software versions and better hardware will lead to a significant increase in performance. The experiments are divided in four parts. At first some results using standard datasets (e.g., from the UCI repository) and on synthetic data are presented. Then experiments on the feature selection challenge from NIPS 2003, which is available (at the time of writing) at CodaLab.org, are shown. Afterwards, detailed experiments on the proposed sparseness constraints are presented. Overall, experiments on more than 15 different datasets, including recent microarray datasets, are performed to cover a large variability of decision tasks.

4.1 Proof of Concept MILP

Figure 5 shows the training and testdata for a simple corner and spiral dataset. The respective networks take a 2-dimensional input and classify a binary output. The left example is based on two perceptrons in the hidden layer. Each perceptron learns a decision plane needed to correctly classify the upper left corner. The spiral dataset also shows that decision planes are learned, but here more planes are necessary to get a suited model after MILP optimization. Overall, this experiment shows, that the neural network formulated as an MILP can be optimized and that the weights of the shallow neural network learn effective decision planes to solve the optimization task.

Fig. 5
figure 5

Left: Training data and test for a simple corner dataset. Right: Training data and test for a spiral dataset

4.2 Feature Selection Task

A useful property of the MILP formulation consists of the option to formulate additional constraints on the optimized weights. One constraint can be the restriction of the weights to binary values and a predefined maximal amount of features which are allowed to be used, see Section 3.2. A typical setting is the analysis of important features from highdimensional data for decision making, as it often occurs in medical applications. The goal is to answer the question, which biological factors can be important for medical diagnostics or treatment.

Fig. 6
figure 6

Left: Features of the first hidden layer neuron after optimizing a shallow neural network. Right: Features of the MILP optimized neural network with sparsity constraints. Both networks are trained on 50% of the ovarian dataset and achieve an accuracy of 95% on the test set (best viewed in color and zoom in). The network on the right side is interpretable as selected factors are used for decision making

For the next experiment the ovarian dataset [16] with 4000 features has been used. It consists of 200 examples, where 100 randomly selected examples are used for training and 100 for testing. Figure 6 shows the optimized weights of one standard perceptron of a shallow neural network which has been optimized using differential back propagation. As can be seen, many features are combined and it is not possible to tell which feature is relevant for decision making. Indeed, the risk for overfitting is very high. The image on the right shows the binary selected features of a globally optimized sparse neuron. This is much easier to interpret. Overall, the dataset is benign, both models achieve an accuracy of approx. \(95\% \) on the test set, but the overall quality heavily depends on the randomly selected samples. Please note, that the sparsity is crucial here, as shown in Table 3.

In 2003, NIPS hosted the feature selection challenge [26]. It consisted of five binary classification problems which are further explained and discussed in [28]. The challenge was the classification of high dimensional data with a limited amount of training data being available. Please note, that the challenge is still available at CodaLab.Footnote 2 Table 1 summarizes the dataset and its intrinsic properties.

Table 1 Summary of the neurips feature selection benchmark dataset

As the benchmark is accessible at CodaLab, the MILP optimized neural networks have been used to participate in there. The proposed approach is currently listed on place 8, as shown in Table 2. For comparison, variants based on SVMs, decision trees and a random forest have also been evaluated and uploaded on the codalab server. Our proposed framework performs superior to these classical methods. Note, that the approach is in the top10 of this well established and frequently used dataset. Additionally, it is an outcome without any bells-and-whistles.

Table 2 Results of the proposed MILP net on the CodaLab benchmark and a comparison to three reference versions based on a random forest, svm and decision tree (accessed June 2022)

Please note, that this (automated) benchmark does only evaluate the accuracy on the test data. Thus, the initial goal of finding sparse solutions is not reflected in these scores. This would require one to balance and evaluate both aspects of sparsity and accuracy for different ML models which is a challenging task by itself. Therefore, this experiment only shows the possibility of optimizing non-differentiable neural network on various types of data. Table 2 shows the entry of the results obtained during the participation (with the username LP2NN). The MILP optimized neural networks are all very shallow, with up to 10 neurons. Tests with deeper nets and more neurons mainly increased the memory consumption and computation times without causing major changes in the scores. Better computational resources or more advanced optimization strategies may be able to change this in the future.

4.3 Sparse Models

In Section 3.2, three variants for enforcing sparsity on the model are proposed. Simple constraints in the MILP allow one to (a) reduce the entries of the weight matrices, (b) minimize the amount of neural activations and (c) reduce the amount of required neurons during training. A motivation is, that the complexity of a given optimization problem, and with this the amount of required neurons, is hard to estimate. One effect is, that performance is unsatisfactory without enough neurons, which could model the given task; or overfitting is observed, which means the model can fully explain the training data, but generalization fails.

Table 3 Evaluation of combining sparsity constraints during model optimization on the wine, breastEW, ovarian and zoo dataset: the letters b, c, d stand for the different constraints to minimize (b) the weights, (c) the amount of activations and (d) the amount of neurons

For the experiments, the classical wine, zoo, breastEW and ovarian dataset have been used. The first two datasets are multicriterial classification tasks, with 3 categories for the wine dataset and 7 categories for the zoo dataset. The ovarian and breastEW dataset provides a binary test. For this experiment 50% of the data is used for training and 50% for testing and then each experiment is repeated ten times to analyze the obtained performance. For this experiment, the amount of neurons is set to the constant number 8. In Section 3.2 three variants to enforce sparsity are described, (1) by minimizing the weights, (2) by minimizing the amount of activations and (3) by minimizing the amount of neurons. All factors can be individually and jointly used in the MILP. Thus, 8 combinations are possible. The impact can be measured by several factors, (A) the accuracy on a test set (B) the amount of optimized weights (C) the amount of neuron activations and (D) the amount of neurons used. Table 3 summarizes the effect of switching off and on these individual factors. The numbers are the mean scores from 10 repetitions. For the accuracy, also the standard deviation of the results is shown. Note, that the absolute numbers are not that relevant, as they depend on the size of the input vectors and the size of the network. Only the relative changes are important. It is clearly visible, that the constraints are working appropriately and they allow a large reduction in the complexity of the neural network, while maintaining (or even improving) the test accuracy. The color code in Table 3 indicates the respective sparsity constraint which has been activated during optimization. As the flag b indicates to enforce sparsity on the amount of neurons, the values in column #\(\omega \) should significantly decrease when this bin is active. The same holds for flag c which penalizes the amount of activations #act and flag d which penalizes the amount of used neurons #neur. Note, that the factors can also be individually weighted, but for this experiment they were kept constant with a factor of 0.1 as mentioned in Section 3.2. It is also worth to notice that without the proposed sparsity constraints (where the flags are set to 0, 0, 0), the test performance is not competitive. The variant 0, 0, 1 corresponds to the model proposed by Thorbjarnarson et al. [65]. The results are inline with the comments of Gambella, et al [22], where it is shown that only minimizing the objective function (and thus to explain the training data) can lead to suboptimal generalization performance. Only with addition of the constraints, does suitable generalization occur. Overall, it also shows that many neural network models which are currently trained are over parameterized.

The numbers in Table 3 indicate that the sparsity on the network weights is the most important ingredient to improve generalization. But still, the optimization of the amount of activations and the minimization of the neurons are also important factors for an efficient and generalizing system. The row int denotes the performance of a discrete neural network with integer input and integer weights. To obtain integer input values each input dimension is scaled to [0, 1], multiplied by 10 and rounded. The optimizer is furthermore only optimizing integer weights for the weight matrices and bias vectors. Thus, together with the optimized step function, a network model is optimized which can work highly efficient on resource limited systems. Please note, that optimizing integer weights, and at the same time minimizing three other aspects (number of neurons, number of weights, and number of activations) is a highly challenging task for differentiable programming.

Table 4 summarizes the mean as well as the minimum and maximum computation times in seconds. The maximum is set to 12000 s. The computation time mainly depends on the complexity of the task with the size of input and output dimensions and the amount of examples, as well as the amount of required integer variables. The variance in the computation time can vary to quite some extent, but also after reaching the time limit, feasible solutions are obtained.

Table 5 summarizes the complexity of the obtained MILPs for different datasets. The column Train summarizes the dimensions of the training data in terms of #\(examples \times \) #dimensions, the dimensions of the obtained objective function f, the inequality constraints A and equality constraints Aeq. The last column Encode shows the time to generate the equations for the MILP solver in seconds. Note, that the equations are only generated once before the optimization. As all data is optimized simultaneously there are no iterations involved. Note, that a comparison of Tables 4 and 5 reveals that the dimensionality of the training data does not directly correlate to the required amount of time for optimization. Whereas the ovarian dataset is of reasonable size and converges after 2654 s for the binary flags (1, 1, 1), the optimization of the much smaller wine dataset takes up the maximum amount of time (12.000 sec). Even though after 12.000 s the performance gap with 0.1% is very small and the feasible solution of good quality. But still, the algorithm has not fully converged up till then. With respect to the hyper-parameters of the MILP optimizer, Gurobi parameters such as Heuristics or MIPFocus can speed up finding feasible solutions, but proving optimality still takes its time.

Table 4 Mean and min/max computation times (in seconds) for different sparsity configurations on the wine, breastEW, ovarian and zoo dataset
Table 5 Size of the resulting MILP for the used datasets

As a final experiment for challenging state-of-the-art classification tasks we perform an analysis on three gene expression DNA microarray datasets and follow the general protocol as described by Liu et al. in [42]. The properties of the dataset used are summarized in the left of Table 6. As the LP easily finds perfect solutions to model the training data, sparsity is necessary. The results and a comparison to state of the art [42], where intrinsically sparse networks are optimized using evolutionary deep learning, as well as another commonly used approach published by Mocanu et al. [50], is shown in the right of Table 6. To demonstrate the impact of the multicriterial sparsity of network weights and activations, the method proposed by Thorbjarnarson et al. in [65], where only the amount of neurons is minimized, is also shown. The decreasing performance shows that it is crucial to not only optimize the amount of neurons, but also the activations and the amount of weights.

Table 6 Used microarray datasets and a comparison of the obtained results to alternative approaches

It is well known that these datasets seriously suffer from an extremely small number of samples which explains the high deviations among several solutions and makes it mandatory to enforce sparse solutions.

5 Summary and Discussion

Based on current developments in training NNs by using an exact optimization solver, this work presents an approach to optimize parameters of a Rosenblatt perceptron, and based on this the parameters of a neural network, as a MILP. The methodology followed in this paper is fundamentally different to differentiable programming and ensures a solution which is globally optimal (with respect to the loss function). The optimized weights can then be used to assemble a neural network for inference on unseen data. The parsing and definition of the MILP have been implemented in the MATLAB [48] environment and Gurobi [24] is used for optimization. In this paper, three constraints on sparsity are proposed to (a) minimize the network weights, to (b) minimize the amount of activations and to (c) minimize the amount of neurons. Additionally it is possible to optimize discrete models (only working on integer values). All constraints can be neatly formulated as additional constraints in the MILP and can even be jointly optimized. This allows sparsity and resource optimization of neural networks. Resource optimization involves the minimization of both, the amount of neurons, as well as the amount of activations required for decision making. An empirical comparison with the baseline formulation shows, that sparsity measures are important for generalization on unseen data.

The presented approach has one major drawback as mentioned earlier by Gambella et al. [22]: The computation time for optimization heavily increases with the amount of data and the amount of dimensions. Still, the used methodology can guarantee globally optimal results while being nearly parameter free. Especially, it is independent from random seeds, dropouts or other hyperparameters which can be crucial for high quality but are also time and energy demanding in differentiable programming. As part of basic and fundamental research, the used datasets are not in the same league of datasets used with large models and big data applications nowadays. Still, it is important to continue the research on alternative optimization strategies, especially on methods which can reach global optimality and methods that allow the imposition of conditions which are hard to optimize in differentiable programming.

Based on this work, several extensions are possible, ranging from modeling different layer types and operations such as convolutions, pooling or probabilistic models, see e.g., Capobianco et al. [13] or Awiszus et al. [4] to the optimization of the network topology, see Negrinho et al. [54]. It is also possible to generate deeper fully connected networks with the proposed approach. Here, an additional challenge is that the amount of equations and the amount of slack variables are heavily increasing. Recently, the use of graph neural networks for LP solving has been presented by Chen et al. [15]. Finally, only standard solvers are currently used. Intrinsic properties of the equality and inequality constraints should allow solving the system much faster, see Kronqvist et al. [38]. This will also allow to analyze and model deeper fully connected models as part of future work.