This paper presents a new merit function for custom instruction selection phase of the design flow of application-specific instruction-set processors (ASIPs) in the presence of an area budget constraint. In contrast to nearly all of the previously proposed approaches where ratio of the ASIP speed to layout area is used as a merit function to select the candidate custom instructions (CIs), we show that a merit function based on normalized cycle saving and area function can result in better CI selections in terms of the achievable speedup under a given area budget for both greedy and branch-and-bound techniques. The efficacy of the proposed approach is assessed by comparing the results of using the proposed and conventional merit functions for different benchmarks. The comparison points toward an average (maximum) speed enhancement of 3.65 % (27.4 %) for the proposed merit function compared to the conventional merit functions.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
Clark NT, Zhong H, Mahlke S (2005) Automated custom instruction generation for domain-specific processor acceleration. IEEE Trans Comput 54:1258–1270. doi:10.1109/TC.2005.156
Pozzi L, Atasu K, Ienne P (2006) Exact and approximate algorithms for the extension of embedded processor instruction sets. IEEE Trans Comput Aided Des 25:1209–1229. doi:10.1109/TCAD.2005.855950
Keutzer K, Malik S, Newton AR (2002) From ASIC to ASIP: the next design discontinuity. In: Proceedings of international conference on computer design: VLSI in computers and processors, pp 84–90. doi:10.1109/ICCD.2002.1106752
Lu YS, Shen L, Huang LB, Wang ZY, Xiao N (2009) Optimal subgraph covering for customisable VLIW processors. Comput Digit Tech 3:14–23. doi:10.1049/iet-cdt:20070104
Siew-Kei L, Srikanthan T, Clarke CT (2009) Selecting profitable custom instructions for area–time-efficient realization on reconfigurable architectures. IEEE Trans Ind Electron 56:3998–4005. doi:10.1109/TIE.2009.2017091
Bonzini P, Pozzi L (2008) Recurrence-aware instruction set selection for extensible embedded processors. IEEE Trans Very Large Scale Integr (VLSI) Syst 16:1259–1267. doi:10.1109/TVLSI.2008.2001863
Clark N, Hormati A, Mahlke S, Yehia S (2006) Scalable subgraph mapping for acyclic computation accelerators. In: Proceedings of international conference on compilers, architecture and synthesis for embedded systems, pp 147–157. doi:10.1145/1176760.1176779
Atasu K, Ozturan C, Dundar G, Mencer O, Luk W (2008) CHIPS: custom hardware instruction processor synthesis. IEEE Trans Comput-Aided Des Integr Circuits Syst 27(3):528–541. doi:10.1109/TCAD.2008.915536
Biswas P, Banerjee S, Dutt ND, Pozzi L, Ienne P (2006) ISEGEN: generation of high-quality instruction set extensions by iterative improvement. IEEE Trans Very Large Scale Integr (VLSI) Syst. 14:754–762. doi:10.1109/DATE.2005.191
Clark N, Zhong H, Mahlke SA (2003) Processor acceleration through automated instruction set customization. In: Proceedings of the 36th annual IEEE/ACM international symposium on microarchitecture, pp 129–141. doi:10.1109/MICRO.2003.1253189
Kastrup B, Bink A, Hoogerbrugge J (1999) ConCISe: a compiler-driven CPLD-based instruction set accelerator. In: Proceedings of the seventh annual IEEE symposium on field-programmable custom computing machines, pp 92–101. doi:10.1109/FPGA.1999.803671
Goodwin D, Petkov D (2003) Automatic generation of application specific processors. In: Proceedings of international conference on compilers, architecture and synthesis for embedded systems, pp 137–147. doi:10.1145/951710.951730
Yazdanbakhsh A, Salehi ME, Fakhraie SM (2010) Architecture-aware graph-covering algorithm for custom instruction selection. In: Proceedings of the 5th international conference on future information technology, pp 1–6. doi:10.1109/FUTURETECH.2010.5482719
Muhammad R, Apvrille L, Pacalet R (2008) Evaluation of ASIPs design with LISATek. Lecture notes in computer science, vol 5114. Springer, Berlin, pp 177–186. doi:10.1007/978-3-540-70550-5_20
The LISATek™ solution: automated embedded processor design and software development tool generation. http://www.coware.com/PDF/products/LISATek.pdf
Biswas P, Dutt N, Ienne P, Pozzi L (2006) Automatic identification of application-specific functional units with architecturally visible storage. In: Proceedings of the conference on design, automation and test in Europe (DATE), pp 212–217. doi:10.1109/DATE.2006.244088
Cheung N, Parameswaran S, Henkel J (2003) INSIDE: instruction Selection/Identification and design exploration for extensible processors. In: Proceedings of the international conference on computer aided design, pp 291–297. doi:10.1109/ICCAD.2003.1257681
Clark N, Jason B, Michael C, Mahlke S, Biles S, Flautner K (2005) An architecture framework for transparent instruction set customization in embedded processors. In: Proceedings of the 32nd annual international symposium on computer architecture, pp 272–283. doi:10.1109/ISCA.2005.9
Gonzalez RE (2000) XTENSA: a configurable and extensible processor. IEEE MICRO. 20(2):60–70. doi:10.1109/40.848473
Scharwaechter H, Kammler D, Leupers R, Ascheid G, Meyr H (2011) A retargetable framework for compiler/architecture co-development. Des Autom Embed Syst 15:1–32. doi:10.1007/s10617-011-9080-8
Pan Y, Mitra T (2004) Characterizing embedded applications for instruction-set extensible processors. In: Proceedings of the design automation conference (DAC), pp 723–728
Galuzzi C, Bertels K (2011) The instruction-set extension problem: a survey. ACM Trans Reconfigurable Technol Syst 4(18):1–28. doi:10.1145/1968502.1968509
Liao S, Devadas S (1997) Solving covering problems using lpr-based lower bounds. In: Proceedings of the 34th annual conference on design automation (DAC’97), pp 117–120. doi:10.1145/266021.266046
Peymandoust A, Pozzil L, Ienne P, Micheli GD (2003) Automatic instruction set extension and utilization for embedded processors. In: Proceedings of the 14th international conference on application-specific systems, architectures and processors (ASAP’03), pp 108–118. doi:10.1109/ASAP.2003.1212834
Lam S-K, Srikanthan T (2009) Rapid design of area-efficient custom instructions for reconfigurable embedded processing. J Syst Archit 55(1):1–14
Brisk P, Kaplan A, Sarrafzadeh M (2004) Area-efficient instruction set synthesis for reconfigurable system-on-chip designs. In: Proceedings of the 41st annual design automation conference (DAC), pp 395–400
Zuluaga M, Topham N (2009) Design-space exploration of resource-sharing solutions for custom instruction set extensions. IEEE Trans Comput-Aided Des Integr Circuits Syst 28(12):1788–1801. doi:10.1109/TCAD.2009.2026355
The GNU operating system. www.gnu.org
Nangate 45 nm open cell library (2008) . http://www.nangate.com
Ramaswamy R, Wolf T (2003) PacketBench: a tool for workload characterization of network processing. In: Proceedings of IEEE international workshop on workload characterization, pp 42–50. doi:10.1109/WWC.2003.1249056
Guthaus MR, Ringenberg JS, Ernst D, Austin TM, Mudge T, Brown RB (2001) MiBench: a free, commercially representative embedded benchmark suite. In: Proceedings of 4th IEEE international workshop on workload characterization, pp 3–14. doi:10.1109/WWC.2001.15
Lee C, Potkonjak M, Mangione-Smith WH (1997) MediaBench: a tool for evaluating and synthesizing multimedia and communications systems. In: Proceedings of 30th annual IEEE/ACM international symposium on microarchitecture, pp 330–335. doi:10.1109/MICRO.1997.645830
In this section, we provide the motivation for the proposed merit function. For the sake of simplicity and without loss of generality, for this example, we make several simplifying assumptions which are marked by a star in the text below. Assuming that by using the exact identification algorithm of , all CIs that meet the defined constraints (e.g., the number of input and output) for an application are identified. In this example, there are eighteen identified CIs, and similar CIs (based on functional and structural isomorphism) are classified in seven internally-similar CI groups.
The conflict graph, the area (A) and clock saving (CS) factor of each CI group are shown in Fig. 18. Note that CS values in the graph show the clock saving of the CIs for a single iteration of some parts in an application. To make the example simple, we assume that the CIs have no intra-conflict∗ (i.e., the CIs within each group has no conflict), but they may have inter-conflict∗ (i.e., conflicts among the CIs that belong to different CI groups). Note that an edge between any two nodes in Fig. 18 signifies that all CIs of the corresponding CI groups have some conflict with each other. This also means that by selecting a CI group, all CI groups (nodes) that have a conflict with the selected group (there is a conflict edge between them) must be removed from further consideration.
Now, let us define the parameters CS and CS Norm for each CI in a CI group (see Fig. 18). The parameter CS denotes the cycle saving factor for each CI in the group (in this example, all the CIs of a CI group have the same cycle saving∗). Note that, just for the sake simplicity, this assumption has been made only for the motivational example. The purpose of presenting this simple example is to demonstrate that the proposed merit function may improve the speedup of the extensible processors compared to the existing merit functions under the same conditions. Different combinations of cycle savings could be assumed for CIs in the group. These combinations would provide different levels of effectiveness for the proposed merit function compared to the existing merit functions. The results for the efficacy of the proposed function for different combinations of cycle savings in a group are presented in Sect. 5. The parameter CS Norm is the normalized cycle saving. For the ith CI in a CI group, this value is calculated from
We also define the parameters A, A Norm , and Num_CIs for each CI group. Due to fact that all the CIs within a CI group use the same CFU, the parameters A and A Norm denote the area usage and normalized area usage of the CFU which is used for each CI group, respectively. The parameter Num_CIs represents the number of CIs of the CI group. The parameter A Norm for the ith CI in each CI group is obtained from
The normalizations, which are performed using the corresponding maximum values, give rise to the values between 0 and 1 in both cases.
One of Eqs. (4) or (5) is evaluated for all the nodes in the conflict graph and the node with the highest merit value is selected at each iteration of the selection algorithm. Then, the adjacent nodes to the selected node in the conflict graph are removed. This process continues until no node remains in the conflict graph or the area constraint is violated.
In this example, we assume that the area budget is equal to 13 units. To select the CIs from the candidate set, we used the greedy approach. Note that since the design space of this example is very small, we could have used the branch-and-bound technique to obtain the optimal CIs. The selected groups are depicted in Fig. 19. In this example, the merit values are calculated using A Norm , and CS Norm values of the CIs. First, we consider the case of the CSPA merit function. In the first iteration, because the value of the merit function for CI group A is the highest (4.40), this node is selected. Because of the conflict with the node A, CI groups B, D, E and G are removed. After this step, the remaining area is 10 units (13−area(A)=10). In the next (last) iteration, from the remaining CI groups (C and F), the CI groups F is selected which has conflict with the group C. Hence, the group F is the last selected group. After selecting these groups, the final cycle saving may be calculated as
where the CSmax is the maximum CS among all the identified CIs.
If the CyS merit function is used, only the group B will be selected. The CS of this group is 16 (∑CS Norm =3.2) which is greater than the other CI groups. By selecting the group B, the groups A, C, and G must be removed due to conflict. Also, since the remaining area budget is small (13 – area (B) = 2), no other CI group may be selected.
In this example, we achieve a maximum cycle saving of 16, by using 11 area units (two area units are unused). However, the optimal answer to this problem is the groups E and G, which results in a cycle saving of 21 and uses the total area budget. This shows that using CSPA and CyS as the merit functions do not necessarily lead to the optimal solution. The reason is that, CIs with few primitive nodes (such as adder and shifter) and small areas have higher priority to be selected due to their larger CSPAs. On the other hand, CIs with many nodes usually have a higher cycle saving but also have many nodes (and large area) which leads to a large number of conflicts with other CIs. Using CSPA as a merit value can result in selecting CIs with few nodes and a lower CS compared to CIs with many nodes but lower CSPAs. The two consequences of using CSPA as a merit function are selecting low CS CIs with few nodes and removing CIs with many nodes (higher CSs) due to the conflicts with the previously selected CIs. Now, let us consider the case of the CyS merit function which selects the CIs with the higher CS, without considering the area budget usage in the merit function. For this case, after each CI selection, first, the available area budget will be updated and then the CIs whose areas are larger than the updated area budget will be removed from candidate set. As mentioned before, the CIs with higher CSs usually have larger areas and normally more conflict with other CIs. Both of these lead to limiting the choices available for selecting the next CI and, hence, less chance of increasing the speedup much further.
In the case of the proposed merit function, the group E which is the best CI group is selected in the first iteration (see Fig. 19(c)). After selecting this group, the groups A, D, and F are removed due to the conflict. After selecting the group E, the area budget reduces from 13 to 4 (13−area(E)=4) and, hence, in the next iteration, the selection must be done between the groups C and G. The merit value of the group G is greater than that of the group C and, hence, is selected as the better group in the second iteration. After this selection, the area budget reduces to zero terminating the selection phase. Hence, the performance gain of the proposed merit function is better than the conventional merit functions for this example.
About this article
Cite this article
Kamal, M., Yazdanbakhsh, A., Noori, H. et al. A new merit function for custom instruction selection under an area budget constraint. Des Autom Embed Syst 17, 1–25 (2013). https://doi.org/10.1007/s10617-013-9117-2
- Application-specific instruction-set processors
- Custom instruction
- Custom instruction selection
- Merit function