Optimization based on the minimum maximal k-partial-matching problem of finite states machines with input multiplexing

Finite State Machines with Input Multiplexing (FSMIMs) were proposed in previous work as a technique for efficient mapping Finite State Machines (FSMs) into ROM memory. In this paper, we present new contributions to the optimization process involved in the implementation of FSMIMs in Field Programmable Gate Array (FPGA) devices. This process consists of two stages: (1) the simplification of the bank of input selectors of the FSMIM, and (2) the reduction of the depth of the ROM. This has a significant impact both on the number of used Look-Up Tables (LUTs) and on the number of the Embedded Memory Blocks (EMBs) required by the ROM. For the first stage, we present two approaches to optimize FSMIM implementations based on the Minimum Maximal k-Partial Matching (MMKPM) problem: one of them applies the greedy algorithm for the MMKPM problem, and the other based on a new multiobjetive variant of the MMKPM and its corresponding Integer Linear Programing formulation. We also propose a modification of the second stage, in which the characteristics of EMBs are taken into account to improve implementation results. The new optimization process significantly reduces the number of used FPGA resources with respect to the previous one. In addition, the proposed approaches achieve an adequate trade-off between the usage of EMBs and LUTs with respect to conventional FSM implementations based on ROM and to those based on LUT.


Introduction
In digital design, the implementation of Finite State Machines (FSMs) has received attention from researchers for decades [1][2][3][4][5][6][7][8]. Optimizing FSM implementations in terms of area, speed or power consumption is essential to meet the design constraints demanded by applications. The inclusion of Embedded Memory Blocks (EMBs) in Field Programmable Gate Array (FPGA) devices has sparked a renewed interest in FSM implementations in the last years. Mapping the transition and output functions of a FSM into a ROM memory (the corresponding implementations will be referred to as conventional ROM-based FSM implementations) has become an interesting and useful alternative to conventional synthesis methods based on Look-Up Tables (LUTs). The advantages reported in the literature can be summarized as follows. First, exploiting unused EMBs to implement FSMs frees LUTs, which can be used for other purposes [9]. Second, speed improvements can be obtained if the function mapped on the ROM requires a considerable number of logic levels when mapped on LUTs [10,11]. Third, a significant reduction in power consumption can be obtained by mapping FSMs to EMBs and disabling them during idle states [12].
In recent years, different techniques to reduce the number of required EMBs by using a small number of LUTs have been proposed [4,9,11,[13][14][15]. The EMB-based FSM implementations generated by these techniques make a more efficient use of the available FPGA resources, allowing an adequate trade-off between EMB and LUT usage. Most of these techniques use functional decomposition methods to decompose the transition function of the FSM into two subfunctions: one subfunction is implemented using a combinational element (called address modifier) whereas the other is implemented using a smaller memory component [4,9,15]. Other techniques apply structural decomposition methods to divide the logic block that represents the FSM into two or more subblocks [11,16]. Then, a subblock with fewer inputs than the original logic block is implemented using EMBs; the remaining are implemented using combinational elements.
In the context of EMB-based FSM implementations, FSM with Input Multiplexing (FSMIM) is a technique proposed in [13] to reduce the depth of the memory in which the FSM transitions are mapped. This technique included an optimization process and an architecture called FSMIM with transition-based input selection (FSMIM-T), which uses a multiplexer bank as address modifier. FSMIMs take advantages of the fact that state transitions usually involve many don't care inputs. The experiments reported in [10,14] show area and speed improvements with respect to conventional approaches in Xilinx FPGAs. That work was extended in [14] by including a new architecture, called FSMIM with state-based input selection (FSMIM-S), which allows further reductions in the number of EMBs at the expense of reducing the speed and increasing the number of LUTs. Both architectures are based on a combinational circuit, called Input Selector Bank (ISB), which selects a different subset of FSM inputs for each state, allowing to reduce the depth of the memory. The optimization process described in [14] consisted of two stages. In the first stage, the ISB is simplified, allowing to reduce the number of LUTs and eventually the delay imposed by it. In the second stage, the depth of the memory is reduced beyond what the ISB would attain by itself. The FSMIM model has been recently used in research work by other authors [17,18].
In [19], we defined the Minimum Maximal k-Partial-Matching (MMKPM) problem, which allows to model the problem involved in the simplification of the ISB. We probed that MMPKM problem is NP. We also proposed a greedy algorithm and an Integer Linear Programming (ILP) formulation for this problem. A performance study for these approaches using random bipartite graphs were presented; however, they were not applied to the generation of FSMIM implementations, and therefore no implementation detail was considered.
By contrast, in the development of this work, we have applied the MMKPM problem to the optimization of FSMIM implementations in FPGAs for the first time. This has allowed us to study the effect of the MMKPM on the whole implementation process (e.g., the influence of the optimization in the number of LUTs used by the ISB implementation or in the quality of the results of the subsequent stage). On the basis of the analysis of these preliminary results, we have proposed a new multiobjective variant of the MMKPM problem and its corresponding ILP formulation in order to obtain better implementation results.
The main contributions of this paper can be summarized as follows: • A new multiobjective variant of the MMKPM problem and its corresponding ILP formulation are proposed. • The greedy algorithm proposed in [19]  The rest of the paper is organized as follows. Section 2 presents a background of the conventional ROM-based implementations, the FSMIM architectures and the optimization process involved in the generation of FSMIM implementations. In Sect. 3, the new optimization process is presented. In Sect. 4, experimental results are discussed. Finally, concluding remarks and future work are presented in Sect. 5.

Conventional ROM-based implementation
Mealy and Moore machines can be mapped into memory using the architecture shown in Fig. 1a (a specific architecture can be used for Moore machines, but here we consider the most general architecture) [11]. As the EMBs available in current FPGAs are synchronous, we assume Mealy machines with synchronous outputs. The input signals and present state encoding bits form the address of the ROM, which stores the FSM outputs and the next state of each FSM transition. The next state is fed back to the address signal as the present state [20]. The number of bits of the ROM is where S is the set of states, m is the number of inputs, n is the number of outputs, and p = log 2 |S| is the number of state encoding bits. The exponential increase of the ROM size [see (1)] is a critical problem not only for area consumption but also for speed, which is degraded for the following two reasons: (1) the significant routing overhead when many

Finite state machine with input multiplexing
The main purpose of the FSMIM approach is to reduce the number of used EMBs with respect to conventional ROM-based FSM implementations by using a small number of LUTs. For that, the technique tries to reduce the depth of the ROM memory by decreasing both m and |S| in (1). In addition to decreasing the memory requirement, the reduction of the depth can also decrease delay imposed by the memory [20]. However, the speed is not considered as an objective of the proposed technique but only a potential benefit.  Fig. 1b and c. The ROM address is composed of m effective inputs and the p encoding bits of the present group, where p = log 2 N . The aim of the FSMIM-T architecture (see Fig. 1b) is to reduce the memory requirement at the same time that the ISB keeps as simple as possible. This allows efficient implementations in terms of LUTs without degrading the speed. In this case, the ISB is implemented as a bank of multiplexers, which is greatly benefited from embedded multiplexers available in current FPGAs [21,22]. Another advantage of the bank of multiplexers is that it can be optimized in a high-level way because the complexity of a multiplexer is only determined by its number of inputs (i.e., it is not necessary to use the synthesis tool to evaluate its complexity as in the case of general logic functions). The multiplexer bank is composed of m multiplexers called input selectors. For each state, each input selector selects one different effective input of the state from a different subset of FSM inputs. Therefore, each input selector is controlled by a different set of control bits (called selection bits), which are stored in the ROM. This allows to simplify the multiplexers at the expense of increasing the ROM width with respect to the conventional ROM-based implementation. For each FSM transition, the ROM stores the FSM outputs, the encoding bits of the next group, and the selection bits. The number of bits of the ROM of a FSMIM-T is where r is the number of selection bits. The aim of the FSMIM-S architecture (see Fig. 1c) is to implement FSMIMs using the minimum number of EMBs [14]. Each word of the ROM stores the next state and FSM outputs, as in the conventional ROM-based architecture. Therefore, the number of bits of the ROM of a FSMIM-S is In this way, the depth of the ROM can be reduced without increasing its width (as occur in the FSMIM-T architecture). However, the number of used LUTs with respect to the FSMIM-T architecture increases for two reasons. First, input selectors are not multiplexers controlled by selection bits but logic functions that select the FSM inputs from the present state encoding bits; so, the ISB implementation is more complex than the corresponding one in the FSMIM-T architecture, which uses the minimum possible number of selection bits. Second, a new combinational component, called group encoder (GE), is required to map the present state (coded by p bits) to the present group (coded by p bits). FSMIM-S implementations are usually slower than FSMIM-T implementations due to the use of the GE and a more complex ISB [14]. As a conclusion, the FSMIM-T architecture is suitable for designs in which it is critical to reduce the number of LUTs whereas FSMIM-S is more appropriate when the number of EMBs is limited. When the speed must be taken into account, FSMIM-T implementations offer advantages over the FSMIM-S ones.

Generation of FSMIMs from FSMs
The optimization process presented in [14] is used to transform a FSM into a FSMIM whose implementation is efficient in terms of EMB and LUT usage, which can also have a positive impact on speed. This process is independent of the used FSMIM architecture. The optimization problems involved in the process can be described in terms of the Input Selection Matrix (ISM) [14], in which rows contain the effective inputs of each state, and columns contain the FSM inputs connected to each input selector. Let S = {s 1 , s 2 , . . . , s q } and X = {x 1 , x 2 , . . . , x m } be the set of states and the set of FSM inputs, respectively. Let us define an ISM as a matrix A = (a i j ) ∈ M q×m where a i j ∈ X ∪ {0, 1, −} represents the input (a FSM input, a constant value or a don't care) selected for the state s i by the j-th input selector. For clarity, the ISM includes an additional column for the state corresponding to each row. Fig. 2-a shows two examples of ISM obtained from a same FSM.
The optimization process consists of the two following stages: (1) Input Selectors Simplification (ISS), which simplifies the ISB, and (2) State Grouping (SG), which reduces the ROM depth by merging states. First, the ISS stage is applied to the ISM created from the given FSM. Then, the SG stage is applied to the resultant ISM, generating the ISM that determines the final FSMIM implementation.

Input selectors simplification
The complexity of the ISB is measured as the sum of the number of inputs of each input selector, that is, the sum of the number of different inputs in each ISM column (we will refer to this measure as selection cost). The main goal of the ISS stage is to reduce the selection cost by minimizing the number of inputs of each input selector. In general, input selectors with a smaller number of inputs require a smaller number of LUTs. Since the ISB is usually in the critical path (this is always true for FSMIM-T implementations), the speed of FSMIM implementations can increase if the ISS stage reduces the delay of the ISB. The complexity of a multiplexor only depends on the number of inputs. Therefore, in the FSMIM-T architecture, there exists a direct relation between the selection cost and the complexity of the ISB. So, the minimization the selection cost has a strong influence on the reduction of the number of LUTs required by the ISB. However, in the FSMIM-S architecture, the complexity of the logic function defined by the ISB depends not only on the number of inputs but also on the complexity of such logic function. Therefore, the minimization the selection cost has less influence on the reduction of the number of used LUTs.
Given an ISM, the ISS stage permutes the elements of each row to reduce the number of different inputs in each column (i.e., the number of inputs of each input selector of the ISB). We refer to each feasible solution of the ISS stage as a permutation of the ISM. For example, A I SS is a permutation of A (see Fig. 2a) that reduces the number of inputs of the input selectors from 5 to 3, from 3 to 2, and from 2 to 1.
In [14], we proposed an algorithm to simplify the ISB. This algorithm iteratively solves instances of a variant of the classical Knapsack problem (which will be referred to as Knapsack with Conflicts [KC] problem). The KC problem extends the classical Knapsack problem by introducing disjunctive constraints for pairs of items which are not allowed to be packed together into the knapsack [23,24]. The algorithm to simplify the ISB, which will be referred to as KC-based ISS algorithm, processes the columns of the ISM iteratively. With the goal of reducing the number of different elements in each column, a different instance of the KC problem is solved to determine what element of each row (either a FSM input or a DCSI) is located in the column. The remaining elements will form the KC instance corresponding to the next column. Therefore, the algorithm processes as many instances of KC as ISM columns, allowing to reduce the selection cost. The classical dynamic-programming-based method for the Knapsack problem [25] has been modified to solve each instance of KC. Obviously, independently of the optimality of the method used to solve each instance of KC, the algorithm does not guarantee an optimal solution in terms of the selection cost because the selection of elements is done in a column without considering how it affects to the rest of columns.

State grouping
The main goal of the SG stage is to reduce the ROM depth by forming groups of states that can be encoded with the same code (we will refer to them simply as groups). A set of states can form a group if exists an assignment of constant values to the DCSIs that allows to allocate their transitions in different ROM words. For example, the states s 2 and s 3 of A I SS (see Fig. 2a) can be identified by the symbol g 23 in matrix A SG 1 (see Fig. 2c) because the 2nd input selector allows to address the transitions of either s 2 (if 0 is selected) or s 3 (if 1 is selected). Let us say that these states are merged into the group g 23 . So, in FSM transitions, the next states s 2 and s 3 are replaced by g 23 , with (x 1 , 0, −) as input selection for s 2 , and (x 2 , 1, −) for s 3 . Fig. 2b shows the multiplexer corresponding to the 2nd input selector for the FSMIM-T architecture.
Given an ISM, the algorithm presented in [14] (we will refer to it as old SG algorithm) processes all ISM columns iteratively in a certain order. Initially, each single state forms a different group. For each column, the DCSIs are set to 0 or 1 to merge pairs of the groups obtained after processing the previous columns. Fig. 2c shows an example of the SG procedure applied to A I SS (see Fig. 2a). First, s 2 and s 3 are merged into g 23 by setting the DCSIs of the second column (see A SG 1 ). Then, g 23 and s 1 are merged into g 123 by setting, in the third column, the DCSI of s 1 to 0 and all DCSIs of g 23 to 1 (see A SG 2 ). Note that this is possible because g 23 and s 1 have a DCSI in the same column for all their rows. Finally, s 4 and s 5 are merged into g 45 (see A SG 3 ). SG reduces the number of groups from 6 to 3 (G = {g 0 , g 123 , g 45 }); so, the number of encoding bits is reduced from 3 to 2. As the example shows, a low dispersion of the amount of DCSIs in the ISM columns (i.e., a lot of DCSIs in few columns) favours the effectiveness of the algorithm (we will refer to this concept as DCSI dispersion). For example, when SG is applied to A (see Fig. 2a), which has higher DCSI dispersion than A I SS , only the pairs of states (s 2 , s 3 ) and (s 4 , s 5 ) can be merged.

Proposed optimization process to generate FSMIMs from FSMs
In this paper, we present two new approaches for the ISS stage, which are based on the MMKPM problem [19]. In one of these approaches, the greedy algorithm presented in [19] has been applied for the first time to generate FSMIMs; in the other one, a new multiobjective variant of the MMKPM problem and the corresponding ILP formulation have been proposed. Unlike the greedy algorithm, the ILP formulation is able to obtain optimal solutions. In addition, we present a modification of the SG stage to achieve further improvements.

Proposed input selectors simplification
In order to improve the results obtained by the ISS stage, we proposed in [19] the MMKPM problem, along with an ILP formulation and a greedy algorithm. This problem, which is NPcomplete, allows to model the optimization problem involved in the ISS stage. The objective of the KC-based ISS algorithm and the MMPKM problem is to minimize the complexity of the ISB by minimizing the selection cost. Unlike the solutions of the KC-based ISS algorithm, the optimal solutions of the MMKPM problem do correspond to minimum selection cost. Anyway, both can be not optimum in terms of the number of selection bits. In FSMIM-T implementations, this affects both the ISB (which is controlled by the selection bits) and the ROM (which stores the selection bits). In Fig. 2-a, the selection cost and the number of selection bits for A are 5 + 3 + 2 = 10 and log 2 5 + log 2 3 + log 2 2 = 6, respectively. If the input x 2 of the 4th row is moved from column 3 to 2, the selection cost remains at 10, but the number of selection bits decreases from 6 to 5. This paper presents a multiobjective variant of the MMKPM problem, which, in addition to minimizing the selection cost, allow to further reduce the ROM size by minimizing both the number of selection bits (which reduces the ROM width of the FSMIM-T architecture) and the DCSI dispersion (which improves the results of the SG stage for both FSIMIM architectures [see Sect.

2.3.2]). This variant is called MMKPM with minimum selection cost, number of selection bits and DCSI dispersion (MMKPM-SSD).
For simplicity, we restrict the description of the MMKPM-SSD problem to the context of the ISM ( [19] presents a general definition of the MMKPM problem).
represents the rows of A. We refer to the vertices of R and X as rows and inputs, respectively. The edges of E ⊆ R × X link the rows of A with their inputs. The degree of a vertex s ∈ R ∪ X is denoted by deg G (s), and the set of edges in E that are incident to s, by E(s). G cannot represent permutations of A because it does not contain information about columns. Let us define a partial-matching in G as a set of edges without common rows. Each column of A can be represented by a different partial-matching, which selects at most one input from each row (the absence of selected inputs in a row represents a DCSI). Let us define a maximal k-partial-matching in G as a collection P = {P 1 , . . . , P k } of k partial-matchings that define a partition of E. A maximal k-partial-matching determines a permutation of A in which the inputs of the i-th column are given by P i . The selection cost can be calculated from P as k i=1 |X (P i )|, where X (P i ) denotes the set of inputs that are incident to any edge of P i . The objective of the MMKPM problem is to find a maximal k-partial-matching P = , a permutation of A with a minimum selection cost). The MMKPM-SSD problem adds two new objectives of lower priority: the minimization of the number of selection bits and the minimization of the DCSI dispersion. The number of selection bits can be calculated from P as k i=1 log 2 |X (P i )| . Regarding the DCSI dispersion, let us define the weighted cardinality of P as k i=1 i|P i |. Given P, if a DCSI of P i is swapped with an input of P j of the same row (we say that the DCSI is moved from P i to P j ), then |P i | increases by one and |P j | decreases by one, and so the weighted cardinality decreases by j − i when i < j. Hence, minimizing the weighted cardinality of P minimizes the DCSI dispersion in the resultant permutation of A because DCSIs are moved to partial-matchings with higher indices. For example, in Fig. 2a, the weighted cardinality of the maximal k-partial-matchings associated to A and A I SS are 1 · 5 + 2 · 4 + 3 · 2 = 19 and 1 · 6 + 2 · 4 + 3 · 1 = 17, respectively.
Given an ISM A = (a i j ) ∈ M q×k and the corresponding bipartite graph G = (R ∪ X , E), the MMKPM-SSD problem can be formulated as the problem of finding a maximal k-partial- is minimum in lexicographical order. The following is the proposed 0/1 ILP formulation for the MMKPM-SSD problem, which is an extension of that presented in [19]. Let the variables y x,i , z e,i ∈ {0, 1} be defined as and let d j,i ∈ {0, 1} be defined as the bit of weight 2 j corresponding to the binary representation of the value 2 log 2 |X (P i )| − 1, which is a string of 1s whose length is equal to log 2 |X (P i )| . These variables allow to express (4) as the tuple ⎛ where Then, the ILP formulation is given by where Q = subject to In (9), Q and D [see (10) and (11)] are used to sort lexicographically the solutions. Q is an upper bound of the maximum weighted cardinality because |R| ≥ |P i | = e∈E z e,i . D satisfies (8) because |X (P i )| ≤ min{|R|, |X |} (by definition of a partial-matching), and therefore k D is an upper bound of the number of selection bits.
Constraint (12) ensures that each P i has at most one edge incident to each row, i.e., that P i is a partial-matching. Constraint (13) ensures that one edge cannot belong to different partialmatchings, i.e., that the k partial-matchings are disjoint. In addition, each edge belongs to a partial-matching due to (14); therefore, the feasible solutions are maximal k-partialmatchings. Constraints (15) and (16) ensure the coherence of z e,i and y x,i . Constraint (15) guarantees that an edge e ≡ (r , x) belongs to P i only if the input x ∈ X (P i ) (i.e., z e,i is equal to 1 only if y x,i = 1), and (16) imposes that an input x belongs to X (P i ) only if there exists at least one edge e ≡ (r , x) ∈ P i . Regarding the number of selection bits, (17) ensures that d j,i = 0 for all j such that 2 j > |X (P i )| − 1; (18), that d k,i = 1 for the greatest k such that 2 k ≤ |X (P i )| − 1; and (19), that d j,i = 1 for all j < k. As a result, given i, the number of d j,i variables whose value is 1 is equal to the number of bits required for the binary representation of |X (P i )| − 1 (i.e., log 2 |X (P i )| ).
The ILP formulation of the MMKPM-SSD problem differs from that of the MMKPM problem in the objective [see (9)] and in the new constraints (17), (18), and (19).

Proposed state grouping
With the goal of reducing as much as possible the depth of the ROM of FSMIM implementations, the old SG algorithm processes the ISM until no more groups can be merged (i.e., until the depth of the ROM cannot be further reduced). However, the depth of the implemented ROM is usually greater than that obtained by the algorithm because of the discrete size of EMBs. For example, the FSMIM of Fig. 2a has |S| = 6 states, m = 3 effective inputs and p = 3 state encoding bits. The FSMIM-T corresponding to the ISM A I SS requires r = log 2 3 + log 2 2 + log 2 1 = 3 selection bits. Supposing that the number of outputs (n) is 8, the FSMIM-T requires a ROM of 2 m · |S| = 48 words of n + p + r = 14 bits, that is, a ROM of 672 bits [this value can be obtained with (2)]. If we suppose EMBs of 512 bits with 32 words of 16 bits, then the FSMIM-T implementation requires two EMBs (this is only a toy example to illustrate the concept mentioned above, since EMBs are greater in real devices). After the old SG algorithm reduces the number of groups from 6 to 3 (see A SG 3 in Fig. 2c), the depth of the required ROM is reduced from 48 to 2 m · 3 = 24 words whereas the size of each word increase from 14 to 16 bits due to r = log 2 3 + log 2 4 + log 2 3 = 6 and p = log 2 3 = 2. So, the FSMIM-T can be implemented in a unique EMB. However, the FSMIM-T corresponding to the ISM A SG 2 (Fig. 2-c), which has 4 groups instead of 3, can also be implemented in one EMB due to it requires a ROM of 2 m · 4 = 32 words of 16 bits (since r = log 2 3 + log 2 4 + log 2 3 = 6 and p = log 2 4 = 2). Therefore, in this example, the old SG algorithm merges more states than necessary because 4 groups are enough to minimize the number of required EMBs.
In the cases in which reaching the minimum number of groups is not necessary to minimize the number of required EMBs, the ISM obtained by the algorithm is more complex than necessary due to the constant values assigned to DCSIs, which has a negative impact on speed and area. In both FSMIM architectures, the number of required LUTs can increase because, in FSMIM-T, the number of selection bits grows, and, in FSMIM-S, the selection function corresponding to the ISB has fewer don't care outputs. This increase in the number of LUTs can cause a decrease of the maximum operating frequency. Regarding EMBs, the width of the ROM of the FSMIM-T architecture grows with the number of selection bits, which can increase the number of required EMBs. However, this effect does not occur in FSMIM-S implementations, in which the ROM width remains constant due to it only depends on the state encoding bits and the number of outputs. To sum up, merging more groups than necessary can increase the number of LUTs required by both FSMIM architectures as well as the number of EMBs required by FSMIM-T implementations (but not by FSMIM-S implementations). Therefore, from the point of view of the implementation results, the optimal number of groups is not the minimum value found by the old SG algorithm but the maximum number of groups that minimizes the number of EMBs required by the implementation.
This paper proposes an algorithm to find the optimal number of groups (called EMBgranularity-based SG algorithm), which is a modification of the old SG algorithm. After each pair of groups are merged, the obtained ISM is used to estimate the number of EMBs required by the corresponding FSMIM implementation, and it is considered as the candidate solution only if the number of EMBs decreases. After all columns have been processed, the last ISM chosen as candidate solution determines the best solution (i.e., the one with the optimal number of groups). To estimate the number of EMBs, the algorithm takes into account the available configurations for the depth and width of the EMBs of the target device [21,22].
Compared with the old SG algorithm, the EMB-granularity-based SG algorithm allows to reduce the ISB complexity of both architectures and the ROM size of the FSMIM-T architecture; however, it cannot further reduce the ROM size of the FSMIM-S architecture.

Experimental results
Two different optimization strategies have been implemented, which differ in the technique applied in the ISS stage. One of them (called ILP strategy) applies the proposed ILP formulation to solve the MMKPM-SSD problem and the other (called greedy strategy) applies the greedy algorithm presented in [19] to solve the MMKPM problem. In the SG stage, both strategies apply the EMB-granularity-based SG algorithm. The main purpose of this section is to compare both strategies with the optimization process published in [14] (hereinafter called OLD-FSMIM), which applies the KC-based ISS algorithm in the ISS stage and the old SG algorithm in the SG stage. In addition, the FSMIM implementations obtained with the presented strategies are compared with conventional FSM implementations based on ROM (CONV-ROM) and with those based on LUT (CONV-LUT).
All designs have been synthesized and implemented in a Intel Max 10 device (10M50DAF484C6GES) using Quartus Prime software version 18.0. Therefore, the presented results include routing overhead. The maximum clock frequency and the resource utilization (EMBs and LUTs) of the implemented FSMs have been obtained using speed and area optimization, respectively. As the main goal of EMB-based FSM implementations is to save LUTs by using EMBs, their efficiency can be measured by calculating the number of saved LUTs with respect to CONV-LUT per each used EMB (this measure will be called saved LUTs per EMB [SLPE]).
The target device includes 182 EMBs of 9 Kbits, which can be split into two independent EMBs of 4.5 Kbits (residual 4.5 Kbits EMBs are computed as 0.5 EMBs). The experimental study uses the IWLS93 standard benchmark set [26] (composed by 43 FSMs) and a bechmark set generated by BenGen tool [27] (composed by 150 FSMs) (these FSMs will be referred to as medium-sized test cases because the majority have that size). We have discarded the cases in which CONV-ROM only requires half EMB (i.e., the minimum amount of memory that can be instantiated) and the cases in which the FSMIM technique cannot be applied (i.e., in which the number of effective inputs is equal to the number of inputs and, after the ISS stage, there is no ISM column with two or more DCSIs). The final number of used FSMs was 57. In addition, in order to evaluate the proposed technique with larger FSMs, we have generated 72 synthetic test cases (which will be referred to as large-sized test cases). The first and third quartiles for the ROM size in bits of the CONV-ROM implementations for medium-sized and large-sized test cases are, in this order, 1.7 × 10 4 , 3.1 × 10 5 , 3.0 × 10 6 and 2.7 × 10 7 . Thus, the sweep range is wide and the samples are quite homogeneously distributed.
As MMKPM-SSD is an NP-complete problem, the computation time spent by the corresponding ILP formulation to find an optimal solution is expected to be high when the size of the problem instance grows. In 54% of the total cases (including both medium-sized and large-sized test cases), Gurobi reaches the time limit, and so the solution found is not optimal. However, as Sects. 4.1 and 4.2 show, even in these cases, the optimization process obtains better results using the ILP formulation than using the greedy algorithm for the MMKPM problem or the KC-based ISS algorithm. For the cases in which the time limit is reached and the solution found is not optimal, the average gap (which indicates how long is the obtained solution from the optimal solution) is 22%. Therefore, the optimization results could be improved by increasing the time limit. Regarding the other approaches to solve the ISS stage, the execution time for the greedy algorithm and the KC-based ISS algorithm are 0.03 s and 134 s, respectively. The average speed-up obtained by the greedy algorithm is 2,282. In addition to the fact that the greedy algorithm is the fastest, the optimization process obtains better results using it than using the KC-based ISS algorithm (see Sects. 4.1 and 4.2).

Comparison between the proposed strategies and OLD-FSMIM using medium-sized test cases
The results obtained using the medium-sized test cases are summarized in Table 1 ) than OLD-FSMIM. One test case has been excluded from the statistical data because the FSMIM-T implementation generated by OLD-FSMIM requires more than 182 EMBs; however, the implementation was possible using the proposed strategies.
Regarding the EMB usage, as explained in Sect 3.2, the EMB-granularity-based SG algorithm cannot further reduce the ROM size of the FSMIM-S architecture compared with the old SG algorithm. In addition, in the FSMIM optimization process, the influence of the ISS stage on the ROM size of the FSMIM-S architecture is not very significant because the ISS stage cannot reduce the ROM width of this architecture (as occur in the FSMIM-T one): this stage can only improve the results of the subsequent SG stage by concentrating the DCSIs in the minimum number of ISM columns. Therefore, the effect of the proposed strategies in the EMB reduction of the FSMIM-S architecture is less likely to be observed in medium-sized FSMs. This is confirmed by the results (the average reduction is 0% for both strategies). However, for FSMIM-T implementations, the ILP strategy reduces the number of EMBs in 44% of the cases in which OLD-FSMIM uses more than 0.5 EMBs (i.e., in which a further reduction is possible); for these cases, which represents the 30% of the whole sample, the average reduction is 27%. The greedy strategy obtains similar results. Regarding the LUT usage, the average reduction obtained for all architectures and strategies range from 22 to 24% with hit rates from 70 to 85%. The miss rates in FSMIM-T implementations are null or negligible; however, they reach 19% in FSMIM-S ones. The reason is that, unlike the multiplexer bank of the FSMIM-T architecture, the complexity of the ISB of the FSMIM-S one depends not only on the number of inputs but also on the complexity of the logic functions. Therefore, the minimum number of LUTs is not always reached with the minimum selection cost.
Regarding speed, the difference between the hit rate (70%) and the miss rate (21%) for FSMIM-T implementations points out that, in general, in addition to reducing the FPGA resources, the proposed strategies increase the speed (despite the poor average increment); however, this improvement is not so clear in the case of the FSMIM-S architecture because of the reasons mentioned above.
On average, the results of ILP strategies do not show significant improvements with respect to that of greedy strategies. In general, the reason could be that MMKPM problems related to medium-sized test cases are no greater enough. As it is difficult to directly compare both strategies using the results of Table 1, we have calculated the improvements in EMB and LUT usage of the ILP strategy with respect to the greedy one. In FSMIM-T implementations, the hit rate is 12% in EMB reduction and 21% in LUT reduction, with an average reduction for the success cases of 17% and 20%, respectively. In both comparisons, the miss rate is null or negligible. In FSMIM-S implementations, despite of the less influence of the ISS stage on the complexity of the ISB, the hit rate in LUT reduction is 22% whereas the miss rate is only 14%.

Comparison between the proposed strategies and OLD-FSMIM using large-sized test cases
The results obtained using the large-sized test cases are summarized in Table 2. These results show that the effectiveness of the proposed strategies grows with the size of the FSM. Both the area and speed results for FSMIM-T implementations are significantly improved when the strategies are applied to larger FSMs (compare HR and Mean of Tables 1 and 2). Regarding FSMIM-S implementations, the number of EMBs is reduced with respect to OLD-FSMIM (the hit rate achieved by the ILP strategy is 43% with an average reduction of 23% for these cases). Therefore, the proposed strategies obtain the best results considering the main goal of the FSMIM-S architecture (i.e., the implementation of FSMIMs with a minimum number of EMBs, being secondary the reduction of the number of LUTs as explained in Sect. 2.2). However, in general, the improvement in the EMB reduction obtained by the SG stage negatively affects the ISB complexity and degrades the results of the ISS stage.
In FSMIM-S implementations, this degradation along with the less influence of the ISS stage on the complexity of the ISB reduces the improvement in LUT usage, which explains the diminishment of the average LUT reduction in large-sized FSMs. Despite this, in both architectures, the net effect is that the efficiency of the proposed strategies in the use of resources increases significantly with the size of the FSM, as the SLPE results indicate (compare the SLPE of Tables 1 and 2). As a conclusion, these results show that, compared to OLD-FSMIM, the effectiveness of the proposed strategies grows with the size of the FSM for both FSMIM architectures.
Regarding the comparison between greedy and ILP strategies, despite of the fact that the greedy one achieves significant improvements over OLD-FSMIM, the ILP strategy further improves these results (in a significant way for FSMIM-T). To highlight the differences between the proposed strategies, the improvements of the ILP strategy with respect to the greedy one are summarized in Table 3. The hit rate in EMB reduction is 76% in FSMIM-T implementations, and 32% in FSMIM-S ones (they are about 10x larger than the miss rates); in both cases, the average reduction for the success cases is close to 15%. These improvements are due to the lower DCSI dispersion obtained by the ILP formulation in the ISS stage, which allows the SG stage to further reduce the number of groups.
The LUT reductions obtained by the ILP strategy with respect to the greedy one are also significant for FSMIM-T (with an average reduction of 14% and a hit rate of 72%); however, it is negligible for FSMIM-S (although the hit rate is 60%). Despite this, in both cases, the SLPE results show that the ILP strategy is clearly more efficient than the greedy one in the use of resources. The hit rate in SLPE increment is 75% in FSMIM-T implementations; on average, ILP strategy saves 15% more LUTs per used EMB than the greedy strategy, and this value increases to 21% for the success cases. Although the average values are lower for FSMIM-S implementations, in the 75% of the cases, the ILP strategy saves on average 11% more LUTs per used EMB.
Finally, although the improvements in speed are small, the hit rates are at least 68% while the miss rates are at most 32%. These results show that the ILP strategy can improve the  resources utilization without degrading or even increasing the speed respect to the greedy one. As a conclusion, the greedy strategy, which is faster and does not require any solver, offers a good balance between the quality of results and the computation time; so, it is a suitable candidate if the requirements are not very demanding. However, if the design requirements are not met, the ILP strategy should be used.

Comparison between the proposed strategies and the conventional FSM implementations
The FSMIM implementations generated by the proposed strategies are compared with the conventional FSM implementations (CONV-ROM and CONV-LUT). This study uses the same benchmark set (the medium-sized test cases) and the same measures as in [14]. How-ever, the logic structure of the target device (an Intel FPGA) is different from those used in previous work about FSMIM (Xilinx FPGAs). We have not used the large-sized test cases because they have been generates exclusively to evaluate the EMB-based FSM implementations, whose results mainly depend on structural properties of the FSM (number of states, transitions, inputs, effective inputs, and outputs). Therefore, we understand that these benchmarks provides less confidence to evaluate CONV-LUT, whose results significantly depend on the complexity of the transition and output functions. The FSMs of CONV-LUT and the ROMs of CONV-ROM have been described according to VHDL templates provided by Intel. All FSMIM approaches achieve an average EMB reduction with respect to CONV-ROM of at least 70%, with a hit rate of 100% (see Table 4). Regarding the number of used LUTs with respect to CONV-LUT (see Table 5), for all FSMIM approaches, the average reduction is at least 85% (with minimum reductions of at least 48%). Although FSMIM approaches use more LUTs than CONV-ROM, they attain greater SLPE values in all test cases, with an average increment of at least 941% (see Table 4). Such high values show that FSMIM approaches are more efficient that CONV-ROM in the use of EMBs, with FSMIM-S approaches being the most efficient, since they significantly reduce the number of LUTs by using a limited number of EMBs. Regarding speed, although the aim of the EMB-based techniques is to save LUTs, FSMIM-T approaches are faster than CONV-LUT in 35% of cases, with significant improvements in 25% of cases (see Q3 in Table 5). On average, CONV-ROM obtains slightly better results than FSMIM-T approaches but at the expense of a significant increase in the number of EMBs. Therefore, FSMIM-T approaches achieve the best balance between speed and resource usage. With a hit rate of at most 33%, FSMIM-S approaches obtain the worst results (this is the price to pay for being the technique that uses the least number of EMBs and saves the greatest number of LUTs per EMB [see Table 5]). The critical path of CONV-ROM implementations includes one EMB and, eventually, the LUTs required to join EMBs [20]. However, the critical path of FSMIM implementations also includes the LUTs used to implement the ISB or the GE. So, due to the delay imposed by the ISB/GE, FSMIM implementations cannot compete in terms of speed with the CONV-ROM implementations that use a small number of EMBs. However, the number of LUTs required to join EMBs and the routing overhead increase with the number of EMBs, producing a significant increase of the critical path delay. When the number of EMBs of CONV-ROM implementations is big enough, the reduction of the critical path delay due to the reduction in the number of EMBs achieved by FSMIM approaches can compensate the penalty imposed by the ISB/GE. This effect can be observed in Fig. 3, which shows the relationship between the speed increment of FSMIM implementations with respect to CONV-ROM implementations and the ROM size of CONV-ROM. It includes the results obtained in FSMIM-S and FSMIM-T implementations using the ILP strategy for all test cases whose CONV-ROM implementation fits in the target device. The speed increment grows with the ROM size, specially in the right-hand side of the figure (for these cases, FSMIM approaches are the best EMB-based alternative for both speed and EMB usage). As could be expected, this trend is more clear in the case of the FSMIM-T architecture. We must highlight that 11 of the 57 medium-sized test cases and 64 of 72 large-sized test cases have been excluded because the implementation generated by CONV-ROM does not fit in the target device due to it requires more than 182 EMBs; however, these FSMs were successfully implemented by applying the FSMIM approaches.

Conclusion and future work
In this paper, new techniques for optimizing FSMIM implementations have been proposed. As in previous work, the optimization process is divided into the ISS and SG stages. From these optimization techniques, two strategies that differ in the ISS stage have been established. In one of them, the greedy algorithm for the MMKPM problem (proposed in [19]) has been applied for the first time to the generation of FSMIM implementations. The other strategy is based on the MMKPM-SSD problem and the corresponding ILP formulation, which have been proposed in this paper to improve the optimization results. Finally, the EMB-granularitybased SG algorithm has been proposed to improve the results of the SG stage, which is applied in both strategies. The proposed strategies have been compared to OLD-FSMIM [14] using both medium and large-sized FSM benchmark sets. Regarding the FSMIM-T architecture, these strategies significantly improve the area results. The ILP strategy reduces the number of EMBs in 44% of medium-sized test cases in which OLD-FSMIM uses more than 0.5 EMBs, and in 97% of large-sized test cases; in both benchmark sets, the average reduction for the success cases is at least 27% (there is only one case in which the number of EMBs increases). In addition, the average LUT reductions are 24% and 44%, with hit rates of 71% and 90%, respectively (and with insignificant miss rates of at most 7%). The ILP strategy save 25% and 53% more LUTs per each used EMB, with hit rates of 71% and 98% in medium-sized and large-sized test cases, respectively. The average speed increment is not significant for medium-sized test cases, although the hit rate is 70%; however, the improvement is more significant for large-sized test cases, with an average speed increment of 12% and a hit rate of 96%.
Regarding the FSMIM-S architecture, the proposed improvements in the optimization process have an indirect influence on the ROM size of this architecture. Due to this and to the smaller ROM size requirements of FSMIM-S implementations, the proposed strategies cannot improve the EMB usage for the medium-sized test cases. However, the ILP strategy reduces the number of EMBs in 43% of the large-sized test cases with an average EMB reduction of 23% for these cases (the miss rate is only 1%). The relationship between the selection cost and the complexity of the ISB of the FSMIM-S architecture is indirect, which reduces the effectiveness of the ISS stage on LUT usage and speed. Nevertheless, the ILP strategy save 9% and 23% more LUTs per each used EMB in medium-sized and large-sized test cases, respectively; in both sets, the hit rates are greater than or equal to 81%. Finally, although the average speed increments are not significant (at most 5%), the hit rates are greater than the miss rates.
Compared with the conventional techniques, the FSMIM approaches are an interesting design alternative. All FSMIM approaches achieve an average reduction of the number of EMBs with respect to CONV-ROM of at least 70% with a hit rate of 100%. Comparing the LUT usage with respect to CONV-LUT, all FSMIM approaches obtain reductions greater than 47%, with average reductions of at least 85%. The significant number of LUTs saved per EMB shows that the FSMIM approaches are more efficient that CONV-ROM in the use of EMBs, with the FSMIM-S approaches being the most efficient. Regarding speed, although the aim of the EMB-based techniques is to save resources, FSMIM-T approaches are faster than CONV-LUT in 35% of cases. On average, CONV-ROM obtains slightly better results than FSMIM-T approaches but at the expense of a significant increase in the number of EMBs. Therefore, FSMIM-T approaches achieve the best balance between speed and resource usage. Moreover, we have proved that there exist a clear trend between the speed increment obtained by FSMIM approaches and the size of the FSM; thus, for large-sized FSMs, FSMIM approaches are the best EMB-based alternative for both speed and EMB usage. The good implementation results obtained by the proposed strategies prove the feasibility of the FSMIM technique in devices with an architecture different to that of Xilinx FPGAs (which have been used in previous work [10,13,14]).
The average improvements obtained by the ILP strategy with respect to the greedy one for medium-sized test cases are not significant; however, as the miss rates are null or negligible for the FSMIM-T, the ILP strategy offers the possibility of improving the results with the confidence that they will never worse. For this architecture, in a 12% of cases, the ILP strategy uses, on average, 17% less EMBs than the greedy one; the hit rate for LUT reduction is 21% with with an average reduction for the success cases of 20%. For large-sized test cases, the improvements of the ILP strategy over the greedy one are much more significant, reaching an adequate balance between EMB and LUT usage. For instance, the hit rate in SLPE increment is 75% in FSMIM-T implementations; on average, the ILP strategy saves 15% more LUTs per used EMB than the greedy strategy, and this value increases to 21% for the success cases. Although the average values are lower for FSMIM-S implementations, in the 75% of the cases, the ILP strategy saves on average 11% more LUTs per used EMB.
As a conclusion, we propose some practical recommendations for researchers and designers in order to take advantage of the proposed architectures and strategies. The techniques studied could be sorted in ascending order of EMB usage and descending order of LUT usage as follows: CONV-LUT, FSMIM-S, FSMIM-T, and CONV-ROM. In addition, the ILP strategy requires fewer resources (both EMB and LUT) than the greedy one when both are applied to the same FSMIM architecture. The possibility of exploiting all kinds of resources available in FPGAs allows to fit the design into a smaller (and cheaper) device. FSMIM approaches allow to find an adequate trade-off between LUT and EMB usage. In fact, they obtain huge reductions in the LUT utilization by using a reasonable number of EMBs. The FSMIM-S architecture is the best design option when the number of unused EMBs is limited and the speed is not critical; otherwise, the FSMIM-T architecture is a better option. Regarding the strategies, the greedy one, which is faster and does not require any solver, offers a good balance between the quality of results and the computation time; so, it is a suitable candidate if the design requirements are not very demanding. However, if these requirements are not met, the ILP strategy should be used.
As future work, we plan to extend this work in order to improve the presented results. For each kind of FPGA device, the number of LUTs of a multiplexor can be estimated from the number of inputs. We want to modify the objective function of the MMKPM-SSD problem to quantify the number of LUTs required by the multiplexors of the FSMIM-T implementation. This will allow the ISS stage to find optimal solutions in terms of LUT usage, which will improve the area results. Regarding speed, the input selector with more inputs determines the critical path delay of the ISB; therefore, reducing the maximum number of inputs could reduce the delay. In this direction, we plan to propose another variant of the MMKPM-SSD for speed optimization that minimizes the maximum number of inputs of the input selectors instead of the selection cost. With aim of improving the implementation results for the FSMIM-S architecture, we plan to analyze the influence of the encoding of states and groups on the number of LUTs required by the ISB and the GE. Finally, we will develop a new version of the free FSMIM-Gen tool [28] that includes the proposed optimization process. FSMIM-Gen starts from the specification of the FSM in KISS format [26] and generates a synthesizable VHDL, which can be synthesized and implemented within the design flow of Electronic Design Automation (EDA) tools.
Funding Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.