Fully Customizable Cores: TCE
Processor customization in TCE is usually conducted as a hardware-software co-design process. The processor design is iterated by varying the processor template parameters defined using the TCE files while adapting the software to better exploit the features, such as by calling custom operation intrinsics or enhancing parallelization opportunities. An example processor design is depicted in Fig. 3. The co-design process is supported by a set of tools, as illustrated in Fig. 4. Initially, the designer has a set of requirements and goals placed to the end result. It is usual to have a real time requirement as the primary requirement, given as a maximum run time for the programs of interest and the secondary goal is to optimize the processor for low power consumption or minimal chip area. In some cases, there can be a strict area and/or power budget which cannot be exceeded, and the goal is to design as fast processor as possible that still fits below the limit. The iterative customization process starts with an initial predesigned architecture. Using a graphical user interface tool called Processor Designer (ProDe), the designer can add, modify, and remove processor components. The connectivity between the components can be customized manually using the GUI or with an automated optimizer. Each iteration of the processor is evaluated by compiling the software of interest to the architecture using TCE’s re-targetable high-level language compiler and simulating the resulting parallel assembly code using an architecture simulator. The simulator statistics give the runtime of the program and the utilization of the different data-path components, indicating bottlenecks in the design. The architecture exploration cycle enables low effort evaluation of different custom operations. For a completely new processor operation, the designer describes the operation simulation behavior in C/C++, estimates its latency in instruction cycles when implemented in hardware, and adds the operation to one of the function units in the architecture. This way it is possible to see the effects of the custom hardware to the cycle count, before deciding whether to include it in the design or not. When a design point fulfilling the requirements has been found, or more accurate statistics of a design point is needed, the designer uses a tool called Processor Generator (ProGe), which produces an RTL implementation of the processor.
Automatic Generation of TTA Accelerators
The main components of TCE that provide substantial automation to the design flow are:
-
Processor Generator (ProGe), which produces an RTL implementation of the designed processor, and
-
Design Space Explorer (explore), which can be used to perform automated transformations on TTAs.
RTL Generation—
Thanks to the modular TTA template, the RTL generation is straightforward and reliable; the generator utility ProGe collects component implementations from a set of hardware databases (HDBs) and generates the interconnection network which connects the units. Instruction fetch and decoding logic is generated based on the given processor architecture description file. The designer has to implement any new function units in VHDL or Verilog and add them to an HDB. A tool is available to automatically validate the function unit implementation against its architecture simulation model. The generated RTL is fed to a standard third party synthesis and simulation flow. This step generates more detailed statistics of the processor at hand, which can again drive the architectural exploration process. The detailed implementation level statistics map trivially back to the design actions at the architecture level. For example, required chip area can be reduced by removing architecture components. Similarly, adding more pipeline stages to complex function units, or reducing the connectivity of the interconnection network helps increasing the maximum clock frequency.
Design Space Exploration—
TCE includes a plugin-based framework for automatic design space exploration. The framework is built around a design space database (DSDB) which stores processor configurations: plugins can then be invoked on a configuration to generate one or more modified configurations. Plugins are guided by the simulated performance of the design in benchmark programs, and by an area and power model characterized on a 130 nm ASIC technology. Two examples of plugins are GrowMachine which adds computational resources to the TTA until no more parallelism can be extracted from the benchmark programs, and ShrinkMachine which removes as many resources as possible while honouring a real-time constraint. The framework includes sufficient functionality to generate processors from scratch, but in practice a better quality is currently obtained by hand-crafting the design, while using the explorer to automate repetitive tasks. Explore has plugins designed to produce efficient TTA interconnection networks (ICs) [24]. The IC is particularly tedious to design by hand, and has a very large design space which poses difficulties for automated search. The main idea is to start from an IC topology identical to a conventional VLIW, and then shrink it through repeated bus merge and register file port merge operations. The resources to be merged are chosen by profiling the benchmark workload and generating a resource utilization covariance matrix: resources with low covariance are seldom used simultaneously and, therefore, can be merged without impacting the performance much. The resulting automated process is shown in Fig. 5.
Configuration of TTA Accelerators
A library of components has been built for integrating TTAs with AXI interfaces, including caches and Load/Store Units which communicate over AXI. Tasks such as starting and stopping execution are done through a debug and control interface, usually accessed through an AXI slave peripheral. The interface can currently be used to clock gate an unused accelerator by placing it in a break state; it would also be straightforward to add further low-power modes. The interface also supports debug features such as multiple breakpoints, stepped execution, and examination of various performance counters and internal state data of the processor. In order to facilitate testing of ASIC prototypes with a limited number of IO pins, a JTAG interface wrapper gives access to memories and control interfaces over a four-wire interface. See Fig. 6 for the default organization of a rvex pipeline.
Adaptable cores: ρ-VEX
The ρ-VEX is a dynamically reconfigurable VLIW processor. It is implemented in VHDL and can be synthesized to FPGA targets for prototyping. The reconfigurable concepts used in the core can also be used in ASIC technology as they do not require the reconfigurable properties of the FPGA. The VEX instruction set used in the processor is created by HP. Its target application domain is media applications (image/video). The core is design-time reconfigurable through VHDL generics in a number of properties: Issue width, functional unit layout, cache sizes, etc. Run-time reconfigurability can also be disabled to save chip area. The run-time reconfigurable core can change its issue width dynamically by splitting the core into smaller subcores that have their own execution context. This way, the processor aims to run a wide range of workload types in the most efficient way possible. Single threaded applications with a high level of parallelism in the code can run on a large high-performance core, and multi-threaded applications can run in parallel on smaller cores.
The core supports a number of execution contexts that can be assigned to pipeline pairs (lanepairs). Both properties are design-time reconfigurable. When a single context is assigned to all lanes, the core runs in the widest configuration. By assigning different contexts to the lanepairs, they will each run a separate program (see Fig. 7). The ρ-VEX uses memory-mapped registers to control its configuration. Supervisor software can load a set of programs into the execution contexts, and use the configuration control registers to select which programs are being executed and in which configuration. The contexts contain all information needed to resume execution when being suspended (for example, when all the lanes are assigned to another context). This includes all registers, program counter, trap/interrupt state etc.
The ρ-VEX platform consists of the synthesizable VHDL designs of the core, cache and a number of peripherals. On the software side, there is an interface tool that can connect to the core and provides the user with extensive debugging capabilities and full control of the processor. The implementation of support for AlmaIF is discussed in Section 4.2.2. There are a number of different compilers that can target the ρ-VEX, there is a port of binutils and GDB, there is basic Linux support (uCLinux with a NOMMU kernel), runtime libraries (the uCLibc C standard library and newlib) and an architectural simulator.
Design-Time Customization
The ρ-VEX processor is both run-time and design-time configurable. When creating a general-purpose platform that must be able to provide high performance for single threads and high throughput for multiple threads, the full dynamic 8-issue core can be used. However, the area utilization of this configuration is high. If the workload is highly parallelizable, a static 2 or 4-issue core configuration can be used. The area utilization is considerably lower, and single-thread performance can be sacrificed because the workload will rely on multithreading to achieve high performance. The reduction in area allows more core to be placed on the chip, resulting in higher total throughput. In addition to design-time configuration of the processors, multiple ρ-VEX cores can be connected to each other using local scratchpad memories, preventing the need to access global memory.
AlmaIF Integration with pocl on ρ-VEX
As an example of how the AlmaIF interface may be implemented in hardware for an existing accelerator, we report the implementation for the ρ-VEX processor in more detail.
The ρ-VEX processor is a VLIW processor, with configurable amount of virtual processors, known as contexts. A control unit known as the reconfiguration controller manages the division of the available computational resources amongst these contexts. All contexts share a common instruction memory, and a common data memory. To allow the processor to be controlled externally, for debugging or otherwise, a number of control registers are available through a debug port.
This was mapped to AlmaIF as follows. First, an additional access port to the instruction and data memories was added to allow the host processor to access these through the respective AlmaIF memory regions. As the ρ-VEX has only one data address space, the AlmaIF parameter memory region was simply mapped to the ρ-VEX data memory as well, making the size of the ρ-VEX data memory the sum of the AlmaIF data and parameter memory sizes.
The ρ-VEX debug port serves a similar purpose as the AlmaIF control region. However, the ρ-VEX and AlmaIF registers do not map one-to-one. Therefore, a translation unit was constructed that emulates the AlmaIF registers. In order to still allow the ρ-VEX debug port to be used natively, it was also mapped to AlmaIF directly, using a memory region not specified by AlmaIF.
Each ρ-VEX context has its own set of debug registers. Therefore, each ρ-VEX context is represented as an AlmaIF core.