CMOS Processors and Memories pp 97-138 | Cite as
CMOL/CMOS Implementations of Bayesian Inference Engine: Digital and Mixed-Signal Architectures and Performance/Price – A Hardware Design Space Exploration
Abstract
In this chapter, we focus on aspects of the hardware implementation of the Bayesian inference framework within the George and Hawkins’ computational model of the visual cortex. This framework is based on Judea Pearl’s Belief Propagation. We then present a “hardware design space exploration” methodology for implementing and analyzing the (digital and mixed-signal) hardware for the Bayesian (polytree) inference framework. This particular methodology involves: analyzing the computational/operational cost and the related micro-architecture, exploring candidate hardware components, proposing various custom architectures using both traditional CMOS and hybrid nanotechnology CMOL, and investigating the baseline performance/price of these hardware architectures. The results suggest that hybrid nanotechnology is a promising candidate to implement Bayesian inference. Such implementations utilize the very high density storage/computation benefits of these new nano-scale technologies much more efficiently; for example, the throughput per 858 mm2 (TPM) obtained for CMOL based architectures is 32–40 times better than the TPM for a CMOS based multiprocessor/multi-FPGA system, and almost 2000 times better than the TPM for a single PC implementation. The assessment of such hypothetical hardware architectures provides a baseline for large-scale implementations of Bayesian inference, and in general, will help guide research trends in intelligent computing (including neuro/cognitive Bayesian systems), and the use of radical new device and circuit technology in these systems.
Keywords
Bayesian Inference Pearl - belief propagation Cortex CMOS CMOL Nanotechnology Nanogrid Digital Mixed-signal Hardware Nanoarchitectures Methodology Performance Price4.1 Introduction
The semiconductor/VLSI industry has been following Moore’s law since the past several decades, and has made tremendous progress, but as it approaches the lower nano-scale regime, it faces several challenges, including power density, interconnect reverse scaling, device defects and variability, memory bandwidth limitations, performance overkill^{1}, density overkill^{2}, and increasing design complexity [1, 2, 3, 4, 5, 6].
The two most important aspects of technology scaling are speed and density. The transistor density available with modern state of the art μ-processors has almost reached one billion per chip [7]. Transistor switching delays are only a few picoseconds, and now these processors are moving to multiple core architectures, which continue to improve the performance [7]. Emerging nanotechnologies will further increase the densities by two to three orders of magnitude [5]. However, it has become increasingly difficult to efficiently use all these transistors [6]. In addition, volume markets, for the most part, are no longer performance driven [6]. If we exclude general-purpose processors/architectures, an important challenge then is: Where else can we efficient utilize silicon technology, and its capabilities: speed, density, and multicore concepts?
One potential “architectural solution” to some of the above challenges includes hardware architectures for applications/models in the neuro-inspired and intelligent computing domain [8]. Most of these applications are inherently compute and memory intensive, and also massively parallel; consequently, it is possible that such applications would be able to fully utilize nano-scale silicon technology much more efficiently.
In recent years, nanoelectronics has made tremendous progress, but Borkar (from Intel^{®}) [4] indicates that, as yet, there is no emerging nanoelectronic candidate with the potential for replacing, within the next 10–15 years, Complementary Metal-Oxide-Semiconductor (CMOS) as it is being used today in state of the art μ-processors. However, recent research has shown that certain kinds of nano-scale electronics, when used in hybrid configurations with existing CMOS, could be useful in creating new computing structures, as well as high density memory, and hybrid and neuromorphic architectures [3, 5, 6, 9, 10, 11, 12, 13]. One of the most promising of these hybrid technologies is CMOL, developed by Likharev [11], which is used in this work.
Another challenge is that of intelligent computing [6]. The term “Intelligent Computing” encompasses all the difficult problems that require the computer to find complex structures and relationships through space and time in massive quantities of low precision, ambiguous, noisy data [2]. The term Intelligent Signal Processing (ISP), has been used to describe algorithms and techniques that involve the creation, efficient representation, and effective utilization of large complex models of semantic and syntactic relationships [2, 6]. ISP augments and enhances existing Digital Signal Processing (DSP) by incorporating such contextual and higher level knowledge of the application domain into the data transformation process [2, 6]. Several areas in intelligent computing, such as AI, machine learning, fuzzy logic, etc. have made significant progress; however, “general-purpose artificial intelligence has remained elusive” [14], and we still do not have robust solutions to even remotely approach the capabilities of biological/cortical systems [6]. Consequently, a number of researchers are returning to neuro and cognitive sciences to search for new kinds of Biologically Inspired Computational Models (BICMs) and “computing paradigms” [2, 15]. Recently, the neuroscience community has begun developing scalable Bayesian computational models (based on Bayesian inference framework [16, 17, 18]) inspired from the “systems” level structural and functional properties of the cortex, including the feedforward and feedback interactions observed in the “subsystem” layers of the visual cortex, and integrating them with some higher-level cognitive phenomena [17, 19, 20]. These computational models, also referred to as “Biologically Inspired Computational Models” (BICMs) [2], have the potential of being applied to large-scale applications, such as, speech recognition, computer vision, image content recognition, robotic control, and making sense of massive quantities of data [6, 21]. Some of these new algorithms are ideal candidates for large-scale hardware investigation (and future implementation), especially if they can leverage the high density processing/storage advantages of hybrid nanoelectronics [3, 5].
The effective design and use of these nanoelectronic structures requires an application-driven total systems solution [5, 6]. Consequently, in this chapter we propose a new family of application-specific hardware architectures, which implement Bayesian (polytree) inference algorithms (inspired from computational neuroscience), using traditional CMOS and hybrid CMOS + nanogrid (CMOL^{3}) technologies.
We began this work by choosing the George and Hawkins’ Model (GHM) [16] of the visual cortex, as a useful family of Bayesian BICMs. Recently, George and Hawkins gave a prognosis about the hardware platforms for their model, and indicated that, learning algorithms will be initially implemented in parallel software, partly because they are still evolving, while the inference framework could be implemented in embedded hardware [22]. One of the biggest problems with Bayesian inference in general is its compute intensiveness (it has been shown to be NP-Hard [6]).
Consequently, in this chapter, we exclusively focus on the Bayesian inference functionality of the GHM, and then develop a set of baseline hardware architectures and their relative performance/price measures using both traditional CMOS and a hypothetical CMOL technology. In the larger perspective, we perform a “hardware design space exploration” (in a limited subset of the space), which is a first step in studying radical new computing models and implementation technologies. The architecture evaluation methodology used here (and elsewhere [3, 23]) has the potential of being applied to a broader range of models and implementation technologies.
This chapter is organized as follows: Section 4.2 provides a brief introduction to the more generic concept of hardware virtualization spectrum, and provides the methodology of “architecting” the hardware [24]. Section 4.3 describes the Bayesian Memory (BM) module, and its Bayesian inference framework. Section 4.4 defines various hardware architectures that support BM. Sections 4.5 and 4.6 discusses CMOS/CMOL digital and mixed-signal hardware architectures for BM. Finally, we discuss the hardware performance/price results in Section 4.7.
4.2 Hardware for Computational Models
4.2.1 Hardware Virtualization Spectrum
When creating a baseline architecture for implementing any computational model or algorithm (from several domains, such as neural networks, intelligent computing, etc.), the one single decision that has the greatest impact on the performance/price of the implementation is the degree of “virtualization” of the computation. This term, incidentally should not be confused with virtual machines. We use virtualization to mean “the degree of time-multiplexing of the ‘components’ of computation via hardware resources” [3]. Since some of the computational models consist of fine grained networks, very fine-grained parallel implementation is generally possible, but varying the degree of virtualization allows us to make a wider range of trade-offs for a more efficient implementation.
The right bottom corner represents a design in which the algorithm is fully hardwired into the silicon (for example, analog architectures in region 8), it achieves maximum parallelism, but has little or no virtualization, and is the least flexible [3]. Such designs are generally very fast and compact [25]. However, in general, it is difficult to multiplex analog components [25, 26], so as we move to the left, towards more virtualization, digital design tends to dominate. However, even in the analog domain, analog communication is not feasible, so digital multiplexed communication is used [25, 26, 27, 28]. For some particular computational models, depending on their dynamics, such designs could be an inefficient implementation [3].
Previous work (other than GHM) has shown that, with possible future hardware technologies, semi-virtualized hardware designs (regions 5–7) will scale well, because they allow large-scale integration and component level multiplexing, and are suitable for multiplexed communication schemes [3, 10, 23].
For several application domains, designs from region 5–7, have not generally been explored, and hence, warrant research and investigation.
4.2.2 Existing Hardware Implementations of George and Hawkins’ Model
The work by George and Hawkins [16, 29] (and the related work at Numenta Inc.) mostly falls into region 1, and only a little bit into region 3. The recent work by Taha [21, 30, 31] explores the combined implementations in regions 3 and 4, for the George and Hawkins Model (GHM) [16]. At this time, we are not aware of any work on custom hardware implementation of the GHM [16]. Hence, we can conclude that most of the work on the hardware implementation of GHM [16], is almost exclusively concentrated in region 1, 3 and 4, and hence, custom hardware implementations in regions 5–7 are still unexplored and present an opportunity for a different kind of hardware/computer architecture research. Consequently, the work presented here explores application-specific hardware designs/architectures^{4} for the Bayesian inference framework (i.e. Pearl’s Belief Propagation Algorithm for polytree) within the GHM [16], in regions 5–7, and investigates the hardware performance/price of these architectures.
Note: The analog community has not yet explored Region 8 for GHM, but in the future, with appropriate research, interesting fully analog designs are no doubt possible. It is not in the scope of this chapter to cover fully analog designs, because the focus of this work is to explore designs which can transition to, and utilize CMOL nanogrid structures. In addition, we do not claim that the designs proposed in this work are better than potential fully analog designs.
It should be realized that there are no universal benchmarks or baseline configurations available to allow a comparison of the hardware for GHM or similar models [3]. In addition, factors that vary over the various models, such as data precision, vector/matrix sizes, algorithmic configurations, and hardware/circuit options, make these comparisons even more difficult. The baseline configuration for any hardware takes into account conservative assumptions & worst case design considerations, the design tends to be more virtualized and perhaps less optimized; hence, the price (area & power) and performance (speed) are on the lower-end, or at the baseline. Consequently, in the future, sophisticated designs would probably have better performance, but at a higher price.
4.2.3 Hardware Design Space Exploration: An Architecture Assessment Methodology
4.3 A Bayesian Memory (BM) Module
This work is based on the George and Hawkins Model (GHM) [16]. GHM has several desirable characteristics which include: the underlying fundamental concepts have a high degree of “interpretability” [32], the perception of time, and a basis in some proposed cognitive mechanisms [33, 34]. It has a modular and hierarchical structure, which allows shared and distributed storage; and is scalable [32, 33]. It also has a simpler non-“loopy” probabilistic framework as compared to [18]. All these characteristics potentially make the GHM [16] the simplest and most generic Bayesian model currently available in the neuroscience community. For all these reasons, it is a good candidate Bayesian model for investigating the related baseline hardware.
Pearl’s BPA [36] for a polytree is summarized by Eqs. 4.1–4.6, which use a matrix/vector notation. As shown in Fig. 4.3b, the particular module of interest is called BM-y. BM-y has n_{c} child modules, each denoted by BM-x_{i}, where i = 1, …, n_{c}. BM-y has a single parent module denoted by BM-z. BM-y exchanges probabilistic messages/vectors with its parent and children. In 4.1, notation (x_{i},y) denotes that a message goes from BM-x_{i} to BM-y, and notation (y) denotes an internal message. The same style applies for 4.2–4.6. Equations 4.5 and 4.6 are equivalent, but 4.6 is easier to implement. Notation ⊗ denotes a ∏ operation, but for only two vectors. If there are n_{c} child BMs, then 4.6 has to be evaluated n_{c} times, corresponding to each child BM-x_{i}.
The following discussion briefly summarizes the significance of the variables and operations in the equations of the BPA.
–λ(x_{i},y) is the input diagnostic message^{6} coming from child BM-x_{i}, λ(y) is the internal diagnostic message of BM-y, and λ(y,z) is the output diagnostic message going to the parent BM-z.
–π(z,y) is the input causal message coming from parent BM-z, π(y) is the internal causal message of BM-y, and π(y,x_{i}) is the output causal message going to the child BM-x_{i}.
–λ(x_{i},y) represents the belief^{7} that BM-x_{i} has in the CB of BM-y. λ(y,z) represents the belief that BM-y has in the CB of parent BM-z. π(z,y) represents the belief that parent BM-z has in its own CB, while ignoring the diagnostic message λ(y,z). π(y,x_{i}) represents the belief that BM-y has in its own CB, while ignoring the diagnostic message λ(x_{i},y).
–M acts as a bidirectional probabilistic associative memory, which stores the probabilistic relationships between the CB of BM-y and BM-z. M is also called the Conditional Probability Table (CPT). B(y) represents the final belief that BM-y has in its own CB, considering all possible diagnostic and causal messages.
– In 4.1, λ(y) is formed by combining all the input diagnostic messages using the ∏ operation. In 4.2, λ(y,z) is formed using a [matrix × vector] operation that converts the internal diagnostic message into the equivalent output diagnostic message. In 4.3, π(y) is formed using a [vector × matrix] operation that converts the input causal message to the equivalent internal causal message. In 4.4, B(y) is formed by combining internal diagnostic and causal messages using ⊗ operation. In 4.6, π(y,x_{i}) is formed by combining internal causal messages and all input diagnostic messages, except λ(x_{i},y), using a ∏ operation.
In short, the BPA can be summarized as: each BM in a hierarchical system (like the one shown in Fig. 4.3c) first updates its internal belief using the incoming belief messages from its parent and children, and later, sends the updated version of its belief back to its parent and children.
4.4 Hardware Architectures for Bayesian Memory
4.4.1 Definition of Hardware Architectures for BM
All of the architectures studied here implement Eqs. 4.1–4.4, and 4.6 of the BPA. These equations can be categorized based on the type of matrix/vector operations: “Vector-Matrix Multiplication” (VMM) [37], Product (Π Off Vectors (POV), and Sum-Normalization (SNL). VMM has two subtypes: (Matrix × Column-Vector) (MCV) and (Row-Vector × Matrix) (RVM). Finally, Eqs. 4.1–4.4, and 4.6 correspond to operations POV, MCV, RVM (POV & SNL), and (POV & SNL) respectively. To implement these matrix/vector operations in hardware requires several different circuit components. Hence, the exact definition of any architecture depends on the particular circuit/hardware components that it uses to implement the equations of the BPA.
Definition of hardware architectures for Bayesian memory (BM)
Component | Equation 4.1 | Equation 4.3 | Equation 4.4 | Equation 4.6 | Equation 4.2 | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
S | O | C | S | O | C | S | O | C | S | O | C | S | O | C | ||
Memory (MEM) | SRAM | ♣♥ | ♣♥ | ♣♥ | ♣♥ | ♣♥ | ||||||||||
CMOL MEM | ♦♠ | ♦♠ | ♦♠ | ♦♠ | ♦♠ | |||||||||||
Conventional arithmetic/ logic components | Digital CMOS | ♣♦♥♠ | ♣♦♥♠ | ♣♦♥♠ | ♣♦♥♠ | ♣♦♥♠ | ||||||||||
Mixed-signal CMOS | ♥♠ | ♥♠ | ||||||||||||||
Structure based components (storage & computation) | Mixed-signal CMOS | ♥ | ♥ | ♥ | ♥ | |||||||||||
Mixed-signal CMOL nanogrid | ♠ | ♠ | ♠ | ♠ | ||||||||||||
MEM access based communication | SRAM | ♣♥ | ♣♥ | ♣♥ | ♣♥ | ♣♥ | ||||||||||
CMOL MEM | ♦♠ | ♦♠ | ♦♠ | ♦♠ | ♦♠ |
The mixed-signal architectures are based on the “Internally Analog, Externally Digital” (IAED) Structure for VMM (SVMM), proposed in [37]. This is a mixed-signal array based SVMM, which combines storage, signal-communication, and computation; storage and input/output are digital, and the analog computations occur along the wires as small increments in charge/current [3, 37]. The difference between the digital and mixed-signal architectures is that Eqs. 4.2 and 4.3 are partially implemented using mixed-signal SVMM or structure based components, because these equations are computationally the most intensive; the implementation of the remaining equations is the same as in digital architectures. The mixed-signal architectures also require conventional digital CMOS arithmetic/logic components and conventional mixed-signal CMOS components for post processing in 4.2 and 4.3.
When selecting a mixed-signal architecture, the single most important criterion was to select a mixed-signal CMOS design, which could later be easily mapped to CMOL nanogrid structures; the SVMM proposed in [37] was such a match, and in addition, it has its advantages and is close to the future realm of “digitally assisted analog circuits” [38]. Consequently, our mixed-signal architecture judiciously replaces some of the important operations using analog/mixed-signal components, according to the CMOS/CMOL design considerations suggested in [3].
The storage in all architectures is digital/binary-encoded. Consequently, it is easier to implement a memory access (read & write) based communication for all architectures.
4.4.2 General Issues
The general issues and assumptions related to our analysis of the hardware implementation of the BM are now briefly discussed.
Precision/Bits
The data precision (n_{bit} = bits representing a single element) required to implement any computational model is dependent on the algorithm, the structural configurations, the application domain, etc. Studies [39, 40], suggest that neuromorphic hardware generally requires a precision of 4–8 bits, but Bayesian methods, which are higher level, more abstract models, when applied to real-world problems generally require 8–16 bits for sufficient accuracy [21, 41]. In this work we have varied the precision over a range of values to understand its contribution to the overall performance/price trade-off. The precisions used here were: 4–32 bit FXP, 32 bit FLP, and 32 bit LNS. All architectures, even mixed-signal, store/communicate data in digital/binary-encoded format.
Communication
Communication is generally sequential and often involves memory access. Internal communication (for internal variables) is as simple as accessing an internal memory. External communication (between parent/child) requires inter-module buses with data, address, and control signals. Depending on the particular variable being read by BM-y, some of the communication is assumed to be virtualized. To evaluate 4.1, BM-y needs to read λ(x_{i},y) from each of its child BM-x_{i}; for this variable, one communication bus is multiplexed across all n_{c} child BMs. To evaluate 4.3, BM-y needs to read π(z,y) from its parent BM-z; for this variable a separate communication bus is assumed. It should be noted that λ(x_{i},y) is required to evaluate 4.1 and 4.6; also, 4.6 is evaluated n_{c} times. Hence, each λ(x_{i},y) value is used in multiple equations. So instead of reading from the child BM-x_{i} multiple times through the virtualized communication bus, we assume that it is read once to evaluate 4.1, and a local copy^{9} is stored in BM-y that can be used later for 4.6. In fact, we assume this for all variables that are read from parent or child BMs.
Number of Parent and Child BMs, and Code Book (CB) Size
Specifications of a Bayesian memory (BM) module
Variable | Size |
---|---|
n_{bit} | 4–32 bit FXP, 32 bit FLP, and 32 bit LNS |
n_{c} | 4 |
n_{CBy} | 4,096 |
n_{CBz} | 4,096 |
λ(x_{i},y) | [4,096 × 1] |
λ(y) | [4,096 × 1] |
λ(y,z) | [4,096 × 1] |
π(y,x_{i}) | [1 × 4,096] |
π(z,y) | [1 × 4,096] |
π(y) | [1 × 4,096] |
B(y) | [1 × 4,096] |
M | [4,096 × 4,096] |
Virtualization
In general, we have the following two ways to virtualize the arithmetic/logic components: virtualize (share resources) these components for similar operations over several BMs, or virtualize different operations/equations within each BM. We chose the latter because it keeps the hardware analysis restricted to one BM, and is more suitable for systems with distributed memory. Most of the designs considered here tend to be more virtualized, which provides a more favorable performance/price. However, as we shall see later, the cost of most BM architectures (except the mixed-signal CMOL architecture) is dominated by memory. Consequently, extreme virtualization of the arithmetic/logic computations may not be a significant advantage; but for the mixed-signal CMOL architecture, virtualization was the only way to make the architecture feasible and scalable.
Hybrid Nanotechnology – CMOL
The nanotechnology components used in this chapter are based on CMOL, which is a hybrid CMOS + nanogrid architecture developed by Likharev [11]. CMOL is mostly used here as digital memory and as an application specific, mixed-signal computational structure for the VMM operation. We use the CMOL/nanogrid modeling described in [3], and follow a similar strategy for the performance/price analysis of the CMOL components. Some of the assumptions for the CMOL analysis are:
–For CMOL memory, read and write can be performed using a common interface/path (row and column decoders, etc.). Only for the variables communicated between parent/child BMs, will we need a simultaneous read and write capability. According to [42], all these assumptions are plausible.
–The area analysis does not consider the CMOS to nanogrid routing (similar to [3]), i.e., the CMOS layer and nanogrid layer are analyzed separately. The total area is taken as the sum of CMOS and nanogrid layers; which is a worst case consideration.
–Our analysis uses a future value of F_{nano} = 3 nm (projected in 2028 [43]). CMOL technology should be mature by then, hence plausible defect rates would be around 2% to 5% [42]; and a recent roadmap suggests defect rates as low as 0.03% [43]. So in the worst case, the hardware overhead for defect-tolerance through redundancy/reconfiguration would be around 5% to 10% of the results shown in this chapter.
4.5 Digital CMOS and CMOL Hardware Architectures for Bayesian Memory (BM)
The all digital architectures considered here assume traditional CMOS arithmetic/logic components. The only difference between digital CMOS and digital CMOL architecture is the storage medium. The digital CMOS architectures use SRAM, while digital CMOL architectures use CMOL memory [3].
4.5.1 Floating-Point (FLP) Architecture
Computations for a Bayesian memory (BM) module
Equation | Matrix/vector operations | FLP multiplication | FLP addition | FLP division | Memory read | Memory write | FPU used |
---|---|---|---|---|---|---|---|
4.1 | POV | n_{c}⋅n_{CBy} | 0 | n_{c}⋅n_{CBy} | n_{CBy} | FPU1 | |
4.2 | MCV | n_{CBz}⋅n_{CBy} | n_{CBz}⋅n_{CBy} | 0 | n_{CBz}⋅n_{CBy} | n_{CBz} | FPU1 |
4.3 | RVM | n_{CBy}⋅n_{CBz} | n_{CBy}⋅n_{CBz} | 0 | n_{CBy}⋅n_{CBz} | n_{CBy} | FPU2 |
4.4 | POV | 2⋅n_{CBy} | 0 | 0 | 2⋅n_{CBy} | n_{CBy} | FPU1 |
SNL | n_{CBy} | n_{CBy} | 1 | 2⋅n_{CBy} | n_{CBy} | FPU1 | |
4.6 | POV^{a} | n_{c}(n_{c}⋅n_{CBy}) | 0 | 0 | n_{c}(n_{c}⋅n_{CBy}) | n_{c}(n_{CBy}) | FPU2 |
SNL^{a} | n_{c}(n_{CBy}) | n_{c}(n_{CBy}) | n_{c}(1) | n_{c}(2⋅n_{CBy}) | n_{c}(n_{CBy}) | FPU2 |
Even though Table 4.3 specifically shows the FLP operations, it still provides a general overview of the computations within the BM. From Table 4.3, we can infer that 4.2 and 4.3 have a computational complexity of O(N_{1}N_{2}), where N_{1} µ n_{CBy} and N_{2} µ n_{CBz}; while 4.1, 4.4 and 4.6 have a computational complexity of O(N), where N µ n_{CBy} and n_{c} is negligible relative to n_{CBy}.
4.5.2 Logarithmic Number System (LNS) Architecture
The LNS architecture is based on a 32 bit single precision LNS data format [46]. The LNS architecture is similar to that shown in Fig. 4.6, except that all FPUs are now replaced by Logarithmic Number system arithmetic Units (LNUs). The earlier discussion on FLP architecture also applies here. Table 4.3 can be used by replacing the FLP operations with LNS operations. Both the FLP and LNS architectures use a 32 bit data format; hence, the datapath structure remains the same, only the requirements with respect to arithmetic operations differ.
4.5.3 Fixed-Point (FXP) Architecture
The FXP architecture has to consider the bit size or precision of the intermediate results of all arithmetic operations. For example, multiplying two n bit numbers will lead to a 2n bit result. When m numbers, each of size n bits, are added, the result could grow to n + ceil(log_{2}(m)) bits. If these results are simply truncated back to n bits, then error/noise will enter the system [47]. In order to maintain accuracy with the FXP architecture (to compare to the FLP and LNS architectures), and for the purposes of the worst case hardware analysis being done here, no intermediate results are truncated. Hence, to maintain intermediate results, we require digital components of different bit sizes, depending on the number and type of the operation. To maintain accuracy and reasonable functional equivalence to the FLP version, the FXP implementations should utilize the full dynamic range of the data [39] as much as possible. One way to accomplish this is to use range normalization. This is a conservative assumption, since it is likely that real implementations can get by with significant precision reduction even with fixed-point representations. However, to maintain reasonable functional equivalency, we have added such normalization to the FXP implementations. Range normalization typically has a hardware/operation overhead consisting of a digital adder/comparator, a multiplier and a divider, as shown in Fig. 4.7 (b type blocks).
In Fig. 4.7, Block B1a computes an intermediate version λ_{tm}(y,z), which is then range normalized by block B1b to obtain λ(y,z). Similarly blocks B2b to B5b, which are attached to the corresponding blocks B2a to B5a, perform range normalization. In general, we cannot^{10} virtualize a single range normalization block across all other blocks, because the sizes of the digital components in the range normalization blocks are dependent on the size of the intermediate results. In blocks B1b to B5b, cl, ch, dh denote current minimum, current maximum, and desired maximum respectively. These correspond to the minimum and maximum values/elements in a vector that is to be range normalized. As in the FLP architecture, blocks B5a and B5b are virtualized for computing variables π(y,x_{1}) to \( p(y,{x}_{{n}_{c}})\). The overhead for the range normalization pushes the FXP architecture analysis towards an extreme worst case; i.e. most other simpler techniques for rounding/truncation will have a smaller overhead.
4.6 Mixed-Signal (MS) CMOS and CMOL Hardware Architectures for Bayesian Memory (BM)
Cauwenberghs [37] proposed an “Internally Analog, Externally Digital” (IAED) mixed-signal array based structure for parallel VMM, for massive (100–10,000) matrix dimensions. The IAED-Structure for VMM (SVMM) effectively combines storage and analog computation, and is better than other analog-only techniques for the VMM computation [37]. Internally, the storage is digital. The array structure allows inherent analog processing [37]. Externally, it provides the convenience and precision of pre/post digital processing [37]. Hence, we use the SVMM as the fundamental operation implemented in our mixed-signal architectures. Our mixed-signal CMOS architecture partly virtualizes SVMM, and our mixed-signal CMOL architecture replaces some of the traditional components of the SVMM with nano components, providing a novel, mixed-signal CMOL nanogrid based SVMM.
The formal equations behind the SVMM operation are presented in more detail in [37], and so are not repeated here. Instead, the operation of these, “semi-virtualized,” mixed-signal CMOS and CMOL SVMM circuits are presented via a simple example.
4.6.1 Mixed-Signal CMOS Architecture
Consider a VMM operation where q = p = 3 and n_{bitx} = 3. The computation required for the first element Y_{1} is given in 4.9, along with some example numbers for X_{i} and M_{i,1}. Each number is represented in “little endian” binary format (the leftmost bit is the MSB and the rightmost bit the LSB), for example X_{1} = 011 and M_{3,1} = 110. Figure 4.8a shows a decomposition of the VMM operation into its constituent sub-operations. To illustrate the SVMM function, we first concentrate on step-1 (a, b, and c). In Fig. 4.8a, step-1a shows the partial products resulting from the multiplication of the first bit (LSB) of each X_{i} with the corresponding bits of each M_{i,1}. The equivalent operation in the mixed-signal CMOS SVMM is shown in Fig. 4.8b. The first bit of each X_{i}, i.e. \( {X}_{i}^{0}\) (denoted by \( {X}_{i}^{b}\), where i = element index, b = bit index, and b = 0 to n_{bitx} − 1) is presented from the left along the horizontal wires. Each cell (denoted by \( {M}_{i,1}^{b}\)) in the analog cell array, in Fig. 4.8b, is a “CID computational cell with integrated DRAM” [37]. The analog cell array corresponds to the first column of M in 4.7. Each cell can store one binary value, and can compute one binary multiplication. In the compute mode, it can contribute some charge/voltage on the vertical wire if its internal (stored) value is 1 and the external input is 1. For example, the cell \( {M}_{1,1}^{0}\) has internal value 1, and its external input \( {X}_{1}^{0}\) is also 1; hence it contributes some charge/voltage (shown by a vertical arrow Open image in new window) to the first vertical wire. Those cells that have an internal value of 0, do not contribute charge/voltage to the vertical wires. The analog cell array in Fig. 4.8b performs the equivalent of step-1a and the initial part of step-1b in Fig. 4.8a. The total charge/voltage (shown by multiple arrows) on each vertical wire is converted to digital outputs by the ADCs, and these outputs correspond to the partial sums shown in step-1b. The Shift Registers (SRs) and adders following the ADCs then complete step-1b. Step-1c is accomplished by the remaining SR and adder. This concludes step-1 or a single pass through the SVMM. In the next iteration we present the second bit of each X_{i}, i.e.\( {X}_{i}^{1}\), on the horizontal wires of the SVMM, to accomplish step-2 (a, b, and c). Step-3 is completed in a similar manner. In summary, to obtain (one element) Y_{j} we have to iterate n_{bitx} times through the SVMM.
For computing the remaining elements of vector Y, we require similar steps; but for each Y_{j} we require a distinct analog cell array that corresponds to the j^{th} column of M in 4.7. The input X remains the same, and is still presented bit-serially as described earlier. The ADCs and arithmetic/logic units are virtualized across all elements of Y.
In our hardware analysis, we assume a precision of n_{bitx} = 8 for both X and M. This makes this architecture comparable to the digital FXP8 architecture. (The block diagram of mixed-signal CMOS/CMOL will be similar to Fig. 4.7, but blocks B1a and B2a are replaced with equivalent SVMM – Fig. 4.8b/c.) The required ADC resolution is 12 bits, corresponding to the actual size of M (specified in Table 4.2). We virtualize all the addition operations in Fig. 4.8b by using a single physical adder.
4.6.2 Mixed-Signal CMOL Architecture
The mixed-signal CMOL nanogrid SVMM is shown in Fig. 4.8c. This is similar to the mixed-signal CMOS SVMM shown in Fig. 4.8b, except that the CMOS analog cell array is replaced with a CMOL nanogrid, and each cell is replaced with an equivalent nanodevice, which functions as a binary switch. If the nanodevice is ON, and the input \( {X}_{i}^{b}\) on that particular horizontal nanowire is 1, then an “on” current will flow in the corresponding vertical nanowire (shown by a vertical arrow Open image in new window). Each vertical nanowire accumulates some units of “on” current as shown in Fig. 4.8c. The ADC then converts the current (or equivalent voltage) to digital. The specific functionality is the same as discussed earlier for the mixed-signal CMOS SVMM and is not repeated here.
The storage for X requires special consideration; elements of X are written by sequential memory access, but we need to read one bit from each X_{i} in parallel. We do not go into details of how that can be done, but instead use a worst case consideration that storage for X has extra routing/circuit overhead of around 50% for this increased functionality.
4.7 Performance/Price Analysis and Results
4.7.1 Performance/Price Analysis
Major circuit components used in implementing a Bayesian memory (BM) module
Components | Digital CMOS architecture | Digital CMOL architecture | Mixed-signal CMOS architecture | Mixed-signal CMOL architecture | ||||
---|---|---|---|---|---|---|---|---|
FXP | FLP | LNS | FXP | FLP | LNS | |||
Digital Adder | Y Y_{R} | Y Y_{R} | Y Y_{R} | Y Y_{R} | ||||
Digital Multiplier | Y Y_{R} | Y Y_{R} | Y Y_{R} | Y_{R} | ||||
Digital Divider | Y_{R} | Y_{R} | Y_{R} | Y_{R} | ||||
Floating-Point Unit^{a} | Y | Y | ||||||
Log Num. sys. Unit^{a} | Y | Y | ||||||
ADC | Y | Y | ||||||
Memory – SRAM | Y | Y | Y | Y | ||||
Memory – CMOL | Y | Y | Y | Y | ||||
Analog CID/DRAM Cell | Y | |||||||
CMOL nanogrid | Y |
Performance/price measures for various circuit components
Component | Area (= A) (mm^{2}) | Power (W) | Time (s) |
---|---|---|---|
SRAM | \( 6\times {10}^{-7}{N}_{b}/2.85\) | \( a0.64A\)where, α= 0.1 | \( 0.025\times {10}^{-9}\) |
CMOL MEM | \( 1.803\times {10}^{-9}{N}_{b}\) | \( 8.974\times {10}^{-12}{N}_{b}\) | \( 1.72\times {10}^{-9}\) |
Dig. Adder | \( \frac{1.2\times {10}^{-2}{a}_{t}^{2}N{\mathrm{log}}_{2}(N)}{32{\mathrm{log}}_{2}(32)}\) | \(0.04A\)where, a_{t}= 22/180 | \( \frac{0.75\times {10}^{-9}{a}_{t}{\mathrm{log}}_{2}(N)}{{\mathrm{log}}_{2}(32)}\) |
Dig. Multiplier | \( \frac{1.2\times {10}^{-4}N(M+1)}{16(16+1)}\) | \( \frac{1.7\times {10}^{-5}N(M+1)}{16(16+1)}\) | \( \frac{0.2\times {10}^{-9}(N+M)}{(16+16)}\) |
Dig. Divider | \( \frac{309{a}_{t}^{2}N{\mathrm{log}}_{2}(N)}{55{\mathrm{log}}_{2}(55)}\) | \(0.04A\)where, a_{t}= 22/1200 | \( 160\times {10}^{-9}{a}_{t}N/55\) |
Analog CID/DRAM Cell | \( 3.24\times {10}^{-5}{a}_{t}^{2}\) | \( 5\times {10}^{-8}{a}_{t}^{2}\)where, a_{t}= 90/500 | \( 1\times {10}^{-5}{a}_{t}\) |
ADC – 12b | \( 17.22{a}_{t}\) | \( 0.033{a}_{t}\)where, a_{t}= 90/1200 | \( 2\times {10}^{-7}{a}_{t}\) |
FPU | \( 0.258{a}_{t}\) | \( 1.032\times {10}^{-2}{a}_{t}\) where, a_{t}= 22/350 | |
–Adder | \( 3{a}_{t}7.692\times {10}^{-9}\) | ||
–Multiplier | \( 3{a}_{t}7.692\times {10}^{-9}\) | ||
–Divider | \( 15{a}_{t}7.692\times {10}^{-9}\) | ||
LNU | \( 16.6{a}_{t}\) | \( 0.355{a}_{t}\)where, a_{t}= 22/1200 | \( {a}_{t}2.631\times {10}^{-8}\) |
–Adder | \( {a}_{t}2.631\times {10}^{-8}\) | ||
–Multiplier | \( {a}_{t}2.631\times {10}^{-8}\) | ||
–Divider | \( {a}_{t}2.631\times {10}^{-8}\) |
4.7.2 Performance/Price Results and Discussion
Comparison of digital and mixed-signal architectures
Architecture | Single BM | Max-chip^{c} | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Area (mm^{2}) | Power (W) | Time (s) | No. of BMs | Total Power (W) | Throughput Per Max-chip ^{f} (TPM) | Normalized TPM | |||||
MEM^{a} | ARL^{b} | Total | MEM | ARL | Total | ||||||
Digital CMOS^{d} | 28.525 | 0.0048 | 28.529 | 1.82562 | 0.00028 | 1.82590 | 0.00790 | 30 | 54.78 | 3797 | 76 |
Mixed-Signal CMOS^{e} | 282.06 | 1.5546 | 283.62 | 0.46599 | 0.00324 | 0.46923 | 0.05980 | 3 | 1.41 | 50 | 1 |
Digital CMOL^{d} | 0.2443 | 0.0048 | 0.2491 | 0.00122 | 0.00028 | 0.00149 | 0.03627 | 3443 | 5.13 | 94848 | 1897 |
Mixed-Signal CMOL^{e} | 0.0121 | 1.5581 | 1.5703 | 0.02127 | 0.00324 | 0.02450 | 0.00727 | 546 | 13.38 | 75103 | 1505 |
In Fig. 4.9, we see that the “Memory components” (MEM) dominate the total area of the BM. This is a major advantage, especially when using CMOL memory, which is 100 times denser than CMOS memory (SRAM). For the “Arithmetic/logic components” (ARL) we see that the FLP and LNS architectures consume less area, as compared to the FXP8–FXP32 architectures, since the FXP architectures are not virtualized to the same extent. In addition, the FXP architectures also had overhead for several range normalization circuits, which were also not virtualized. But because memory dominates the area of a BM processor, this overhead does not add much to the total cost. The ARL in the FLP architecture occupies less area than the ARL in LNS architecture.
From Table 4.6, when comparing digital CMOL to Mixed-Signal (MS) CMOL, we were able to achieve the speed-up (i.e. n_{bitx}⋅t_{SVMM} < p⋅(t_{add}+ t_{mult})) using our virtualized SVMM technique. But the same is not true for the digital CMOS and MS CMOS architectures. The MS CMOS architecture provides the worst performance, with the highest price; only 3 BMs can fit one max-chip, which suggests it is not as scalable of an architecture. The digital CMOS and MS CMOL architectures have approximately the same performance, but MS CMOL is less expensive, allowing us to fit 18 times more BMs in one max-chip, with one-fourth the total power consumption. Notice that the power density of a max-chip for all architectures is well below the allowed 200 W cm^{−2} (for 22 nm CMOS/3 nm CMOL, according to ITRS [43]).
In summary, the MS CMOL architecture, which consists of CMOL memory and a MS nanogrid implementation of the VMM operation is clearly the most cost-effective of the architecture options examined here, providing the best performance at a reasonable price.
An important criterion for comparing various hardware architectures is the “performance/price ratio”, which measures the usefulness/efficiency of silicon area in solving the problem (or set of problems) at hand. For the hardware implementations of computational models, this criterion is defined as a “Module update rate per 858 mm^{2}” or “Throughput Per Max-chip” (TPM) [3, 54]. Table 4.6 shows that the TPM for digital CMOL architecture is 25 times the TPM of digital CMOS architecture; the TPM for MS CMOL architecture is 20 times the TPM for digital CMOS architecture. The TPM for a PC MATLAB implementation of BM is 33, and the TPM for BM implementation on Cray XD1 (supercomputer accelerated with FPGAs) is 2,326 (derived^{14} from the throughputs reported in [21]). Hence, we see that the TPM of CMOL based custom architectures is 32–40 times better than Cray XD1 multi-processor/multi-FPGA system.
The nanogrids in all CMOL architectures studied here were designed to have a worst case power density of ∼200 W cm^{−2} (allowed by ITRS [43]) at hotspots. Hence, all CMOL performance/price numbers correspond to that density. If the power density budget were increased, then performance could be improved. For example, the time to update a MS CMOL architecture based BM reduces to 0.00421 s, if the power density budget were doubled. So there is a clear trade-off between power density and performance for CMOL nanogrids; the same was also suggested in [3].
4.7.3 Scaling Estimates for BM Based Cortex-Scale System
This chapter has exclusively focused on the hardware implementations of Part-B of the BM. The hardware assessment methodology (given in Fig. 4.2) has also been used to investigate the hardware implementations of Part-A of the BM, which deals with learning/training. The detailed discussion on hardware architectures for Part-A of the BM (with a simplified algorithm for learning spatial CB vectors) is given in [20], and not repeated here. The results therein conclude that Part-B of the BM dominates the overall hardware requirements and performance/price of the complete BM (including Part-A and Part-B). The results in [20] also indicate that it is not feasible to map analog/non-linear functions such as Euclidean and Gaussian functions onto CMOL-like nanogrid/nanodevices [55]. Consequently, the MS architectures (for Part-A of the BM) have to depend on traditional analog CMOS circuit components to implement such functions [20, 55], and this limits the usefulness of CMOL structures for implementing the operations within Part-A of the BM [2]. This is contrary to the hardware analysis results for Part-B of the BM, which shows that MS CMOL structures were particularly more cost-effective.
BM based cortex-scale system: final performance/price and scaling estimates
Proposed design | Part-A of BM | Part-B of BM | Total time^{b} (s) | Total area^{a} (mm^{2}) | Norm. area^{c} | No. of (30 cm) wafers | Perf./Price | Norm. perf./price | Total power | Net power density |
---|---|---|---|---|---|---|---|---|---|---|
(W mm^{−2}) | ||||||||||
(s^{−1}mm^{−2}) | (W) | |||||||||
1 | Dig. CMOS | Dig. CMOL | 3.63 × 10^{−2} | 1.67 × 10^{6} | 6.96 | 27 | 1.65 × 10^{−5} | 1.00 | 3.73 × 10^{4} | 2.23 × 10^{−2} |
2 | Dig. CMOS | MS CMOL | 7.28 × 10^{−3} | 8.11 × 10^{6} | 33.79 | 128 | 1.69 × 10^{−5} | 1.03 | 1.49 × 10^{5} | 1.84 × 10^{−2} |
3 | Dig. CMOL | Dig. CMOL | 3.64 × 10^{−2} | 1.22 × 10^{6} | 5.08 | 20 | 2.25 × 10^{−5} | 1.36 | 8.37 × 10^{3} | 6.85 × 10^{−3} |
4 | Dig. CMOL | MS CMOL | 7.45 × 10^{−3} | 7.66 × 10^{6} | 31.92 | 121 | 1.75 × 10^{−5} | 1.06 | 1.21 × 10^{5} | 1.57 × 10^{−2} |
The normalized performance/price ratio for all architectures for the BMCSS is approximately equal, except the all-digital CMOL design (proposed design no. 3), which has the highest performance/price ratio. However, in terms of only silicon area, or the number of 30 cm wafers required to implement the BMCSS, the digital architectures are 5× to 7× more compact compared to the architectures with MS parts in the BM. For the purpose of a crude scaling estimate (in terms of density), the total area for the BMCSS can be compared to the actual area (2.4 × 10^{5} mm^{2} [39]) of cortex, as shown in the column labeled “Norm. Area”; the artificial BMCSS is approximately 5× to 33× larger than the actual cortex. Hence, we can optimistically conclude that with the possible advances in nanoelectronics, we are approaching biological densities, with manageable power (i.e. power density is within the ITRS allowed limits 64–200 W cm^{−2} [57]); this is similar to the claims in [43, 54, 57].
4.8 Conclusion, Contribution and Future Work
The results in this chapter suggest that implementation of Bayesian inference engines and/or Bayesian BICMs is going to be significantly memory dominated. We conclude then that any enabling technology for large-scale Bayesian inference engines will need very high density storage (with potential for inherent computations), and that this storage needs to be accessed via high bandwidth, which limits the use of off-chip storage. Hybrid nanotechnologies such as CMOL have emerged as a very useful candidate for building such large-scale Bayesian inference engines.
We have shown how to effective use the CMOL hybrid nanotechnology as memory, and as a mixed-signal computational structure, for implementing Bayesian inference engines, which forms the core of many AI and machine learning techniques.
We have also proposed a novel use of a mixed-signal CMOL structure for Vector-Matrix Multiplication (VMM), VMM being one of the most fundamental operations used in many computational algorithms. This mixed-signal CMOL structure provides speeds comparable to digital CMOS, which is due to the efficient integration of storage and computation at the nanodevice/grid level.
For our particular application, we have shown that silicon real-estate is utilized much more efficiently (at least 32–40 times better TPM ≅ speed per unit-area) in the case of CMOL, as compared to, for example, a multi-processor/multi-FPGA system, that uses CMOS technology. When compared to a single general-purpose processor/PC, the TPM of CMOL based architectures is 2,200–2,800 times better. It is obvious that CMOL is denser (and slow when used as memory), but what is not as obvious is that CMOL when used in mixed-signal mode can provide reasonable speed-up. However, while using mixed-signal CMOL, external post-processing CMOS components dominate the area, hence, their virtualization is a crucial constraint for design and scaling.
In addition to the results of the architecture study, another important contribution of this chapter is the development and use of a “hardware design space exploration” methodology for architecting hardware and analyzing its performance/price; this methodology has the potential of being used as an “investigation tool” for various computational models and their implementations. This is particularly true for when chip designers start transitioning their existing CMOS solutions to nanotechnology based solutions.
As future work, one needs to explore the remaining design space using both traditional and CMOL technology, and also search for new nanoelectronics circuit candidates that may be better suitable for implementing Bayesian inference engines. Another important approach is to investigate the possibility of more approximate inference techniques that trade-off accuracy for performance.
4.9 Appendix
Note: The values of the variables (related to Pearl’s algorithm) used in the following equations are given in Table 4.2. The performance/price measures for the various circuit components used in the following equations are given in Table 4.5. Most of the subscripts are self-explanatory, and denote the particular component from Table 4.5. The superscripts denote the number of bits for that particular circuit component from Table 4.5. All timing numbers for digital components first need to be normalized as a multiple of t_{clk} = 9.32 × 10^{−11} s, before being used in these equations. The subscript ‘BPA’ refers to Pearl’s belief propagation Eqs. 4.1–4.4 and 4.6; subscript ‘ARL’ denotes arithmetic/logic components; subscript ‘MEM’ denotes memory.
4.9.1 Digital FLP or LNS Architecture
The following equations are for FLP architecture; for LNS architecture, replace subscript ‘FPU’ with ‘LNU’. Also, the following equations are for digital CMOS architectures; for digital CMOL architectures replace subscript ‘SRAM’ with ‘CMOL-MEM’. Here, n_{bit} = 32, corresponding to single-precision FLP/LNS data.
Time
Area
Power
4.9.2 Digital FXP Architecture
Time
Area
Power
4.9.3 Mixed-Signal CMOS Architecture
For mixed-signal CMOS architectures, each element of the CPT, and external data have (n_{bit} =) 8 bit representation. The mixed-signal CMOS SVMM implements only 4.2 and 4.3 of the BPA; for 4.1, 4.4, and 4.6, use the corresponding performance/price equations from FXP architecture.
Time
Area
Power
4.9.4 Mixed-Signal CMOL Architecture
For mixed-signal CMOL architecture, each element of the CPT, and external data have (n_{bit} =) 8-bit representation. The mixed-signal CMOL-based SVMM implements only 4.2 and 4.3 of the BPA; for 4.1, 4.4, and 4.6, use the corresponding performance/price equations from FXP architecture.
MS CMOL Nanogrid for the SVMM
Where, \( {\rho }_{\rm{\hspace{0.05em}}0}=\rm{2}m\Omega \rm{cm}\), \( l=10\text{nm}\), \( {C}_{wire}/L=0.2\rm{fF}m{\rm{m}}^{-}\)[9]; \( r= \ 1{0}^{-8} \Omega c{m}^{2}\)[3]; \( {R}_{on}=100\rm{M}\Omega \), \( V=0.3\rm{V}\), \( a=0.5(1/M)\), \( b=g=0.5\) [58]; and for our implementation N = n_{CBy} for 4.2, N = n_{CBz} for 4.3, and M = n_{bit}.
Time
Area
Power
4.9.5 Example: Use of Architecture Assessment Methodology for Associative Memory Model
We briefly discuss the use of our architecture assessment methodology for implementing an associative memory model (for further details on the model/implementation, refer to [3], [54]). Each step in Fig. 4.2 is briefly summarized below:
Step-1: Assume the (Palm and Willshaw) associative memory model [3].
Steps-2 and 3: Assume that both the input (X) and output (Y) vectors are binary, and a fixed number of active elements (or ‘1’s), l and k respectively. Assume that the weight matrix is trained using the summation-rule (instead of the simple OR-rule), and hence consists of multi-bit weight values. During recall, input vector X is applied to the network; each intermediate output element is obtained as (inner-product) \( {\tilde{y}}_{i}={\displaystyle {\sum }_{j}{w}_{ij}{x}_{j}}\); then a global threshold \( (q)\) is selected such that exactly k number of output elements are ‘1’, i.e. \( {\forall }_{i}\), set y_{i} = 1 if \( {\tilde{y}}_{i}-q < 0\), else set y_{i} = 0.
Step-4: Major equations/operations during recall are: the inner-product for \( {\tilde{y}}_{i}\), and the k-winner take all (k-WTA) to get y_{i}.
Step-5: The inner-product consists of multiplication and addition type computations, and the k-WTA consists of comparison type computations. For the inner-product, multiplication is simple, because input X is binary, hence a multi-input AND-gate can be substituted for a multiplier.
Steps-6 and 7: To accomplish the above computations (and storage) various circuit components (CMOS/Nano, digital/analog) can be used, hence leading to different design configurations. Some of these proposed configurations are: digital CMOS, digital CMOL, mixed-signal (MS) CMOS, and MS CMOL. (There are other possible configurations, but they not covered here. In addition, the MS designs proposed here are different from [3].)
Step-8: For a worst case analysis, assume that the weight matrix is full density (not sparse), and that the performance is estimated using sequential operations and/or worst case timing paths.
Step-9: In digital CMOS design, the weight matrix is stored as SRAM/eDRAM, the inner-product uses digital components, and k-WTA uses digital k-WTA circuit component [3]. Digital CMOL design is the same as digital CMOS, except that SRAM is replaced with CMOL-nanogrid based digital memory. In MS CMOS design, the weight matrix can be stored in an analog floating-gate (FG-cell) array [59]; the k-WTA is done using analog k-WTA circuit [3]. In MS CMOL design, the weight matrix can be stored on 1R nanodevices (or nano-memristors [60]) in a nanogrid; the k-WTA is done using analog k-WTA circuit [3]. In both MS CMOS/CMOL designs, for the inner-product, the multiply occurs in each FG-cell/nanodevice, and the addition occurs along the vertical nanowires as (analog) summation of charge/currents [3]. The size (and performance/price) of digital and analog components depend on the size of the input and output vectors, and the number of active ‘1’s, etc., as selected by the user. The size of the CMOL nanogrids also depends indirectly on the size of the input and output vectors.
Step-10: A system-level block diagram needs to be generated for each design, using the constituent circuit components discussed earlier.
Step-11: The performance/price measures of all digital/analog CMOS circuit components can be derived from (example) published sources. For new components such as the CMOL nanogrid structures, the performance/price measures can be modeled using fundamental area/power equations along with Elmore-delay analysis [3]. These measures have to be scaled to the appropriate/desired CMOS/CMOL technology node using VLSI scaling rules [45].
Step-12: The final performance/price for the complete design (for each particular design configuration) needs to be evaluated considering the system-level block diagram, using the individual performance/price measures of the circuit components, and based on the computational requirements of the operations (i.e. according to the size of the input/output vectors, etc.).
Footnotes
- 1.
“Performance overkill” is where the highest-volume segments of the market are no longer performance/clock frequency driven.
- 2.
“Density overkill” is where it is difficult for a design team to effectively design and verify all the transistors available to them on a typical design schedule.
- 3.
We use the term CMOL to describe a specific family of nanogrid structures as developed by Likharev et al.
- 4.
From a traditional computer engineering perspective, the detailed implementation of designs from regions 5–8 could be referred to as micro-architectures, but for simplicity, we refer to them as architectures.
- 5.
This methodology has been used to study digital and mixed-signal designs/architectures in the past; in order to use it for other designs (including analog designs) one may need to make some adjustments.
- 6.
The “diagnostic” message is also called bottom-up evidence, and the “causal” message is also called top-down evidence.
- 7.
Probabilistic belief or Bayesian confidence is simply referred to as “belief”.
- 8.
A viable alternative (not explored here) to SRAM is eDRAM, which is denser, but is significantly slower.
- 9.
Here we are trading-off local computation and storage for longer range communication.
- 10.
Considering that n_{CBy} ≠ n_{CBz}, we cannot virtualize the blocks across 4.2 and 4.3. We could virtualize the range normalization blocks for 4.4 and 4.6, but have not done so here for simplicity’s sake.
- 11.
Where, t_{op} denotes the time required to complete ‘op’ type operation.
- 12.
This may not be always true, because it depends on the size of M.
- 13.
In accordance with our baseline assumption, the timing analysis for the digital architectures assumes that most of the operations within each block are executed in a sequential manner without any pipelining; however, some of the blocks are executing in parallel.
- 14.
This system has 864 AMD processors, and 150 FPGAs; the reported Nodes/s is 2.25 × 10^{6} for large network of BMs. The TPM is derived by using (22/90) technology-scaling factor for 22 nm, and (1,014 × 200/858 mm^{2}) area factor; we assume each processor-chipset or FPGA (with SRAM banks) will have an area of 200 mm^{2}.
- 15.
Such a crude scaling estimate is intended only for guidance purposes. No structural / functional equivalence to the actual cortex and/or human intelligence is intended or claimed.
Notes
Acknowledgment
Useful discussions with many colleagues, including Prof. K.K. Likharev, Dr. Changjian Gao, and Prof. G.G. Lendaris are gratefully acknowledged.
References
- 1.M.S. Zaveri, D. Hammerstrom, CMOL/CMOS implementations of Bayesian polytree inference: digital & mixed-signal architectures and performance/price. IEEE Trans. Nanotechnology 9(2), 194–211 (2010). DOI: 10.1109/TNANO.2009.2028342
- 2.D. Hammerstrom, M.S. Zaveri, Prospects for building cortex-scale CMOL/CMOS circuits: a design space exploration, in Proceedings of IEEE Norchip Conference (Trondheim, Norway, 2009)Google Scholar
- 3.C. Gao, D. Hammerstrom, Cortical models onto CMOL and CMOS – architectures and performance/price. IEEE Trans Circ. Syst-I 54, 2502–2515 (2007)MathSciNetCrossRefGoogle Scholar
- 4.S. Borkar, Electronics beyond nano-scale CMOS, in Proceedings of 43rd Annual ACM/IEEE Design Automation Conf. (San Francisco, CA, 2006), pp. 807–808Google Scholar
- 5.R.I. Bahar, D. Hammerstrom, J. Harlow, W.H.J. Jr., C. Lau, D. Marculescu, A. Orailoglu, M. Pedram, Architectures for silicon nanoelectronics and beyond, IEEE Computer 40, 25–33 (2007)CrossRefGoogle Scholar
- 6.D. Hammerstrom, A survey of bio-inspired and other alternative architectures, in Nanotechnology: Information Technology-II, ed. by R. Waser, vol. 4 (Wiley-VCH Verlag GmbH: Weinheim, Germany, 2008), pp. 251–285Google Scholar
- 7.Intel, 60 years of the transistor: 1947–2007, Intel Corp., Hillsboro, OR (2007), http://www.intel.com/technology/timeline.pdf
- 8.V. Beiu, Grand challenges of nanoelectronics and possible architectural solutions: what do Shannon, von Neumann, Kolmogorov, and Feynman have to do with Moore, in Proceedings of 37th IEEE International Symposium on Multiple-Valued Logic, Oslo, Norway, 2007Google Scholar
- 9.D.B. Strukov, K.K. Likharev, CMOL FPGA: a reconfigurable architecture for hybrid digital circuits with two-terminal nanodevices. Nanotechnology 16, 888–900 (2005)CrossRefGoogle Scholar
- 10.Ö. Türel, J.H. Lee, X. Ma, K. K. Likharev, Architectures for nanoelectronic implementation of artificial neural networks: new results, Neurocomputing 64, 271–283 (2005)Google Scholar
- 11.K.K. Likharev, D.V. Strukov, CMOL: devices, circuits, and architectures, in Introduction to molecular electronics, ed. by G. Cuniberti, G. Fagas, K. Richter (Springer, Berlin, 2005), pp. 447–478Google Scholar
- 12.D.B. Strukov, K.K. Likharev, Reconfigurable hybrid CMOS/nanodevice circuits for image processing. IEEE Trans. Nanotechnol. 6, 696–710 (2007)CrossRefGoogle Scholar
- 13.G. Snider, R. Williams, Nano/CMOS architectures using a field-programmable nanowire interconnect. Nanotechnology 18, 1–11 (2007)Google Scholar
- 14.NAE, Reverse-engineer the brain, Grand challenges for engineering (The U.S. National Academy of Engineering (NAE) of The National Academies, Washington, DC, [online], 2008), http://www.engineeringchallenges.org. Accessed 15 February 2008
- 15.R. Ananthanarayanan, D.S. Modha, Anatomy of a cortical simulator, in ACM/IEEE Conference on High Performance Networking and Computing: Supercomputing, Reno, NV, 2007Google Scholar
- 16.D. George, J. Hawkins, A hierarchical Bayesian model of invariant pattern recognition in the visual cortex, in Proceedings of International Joint Conference on Neural Networks (Montreal, Canada, 2005), pp. 1812–1817Google Scholar
- 17.T.S. Lee, D. Mumford, Hierarchical Bayesian inference in the visual cortex. J. Opt. Soc. Am. A. Opt. Image Sci. Vis. 20, 1434–1448 (July 2003)CrossRefGoogle Scholar
- 18.T. Dean, Learning invariant features using inertial priors, Annals of Mathematics and Artificial Intelligence 47, 223–250 (2006)Google Scholar
- 19.G.G. Lendaris, On Systemness and the problem solver: tutorial comments. IEEE Trans. Syst. Man Cy. 16, 604–610 (1986)Google Scholar
- 20.M.S. Zaveri, CMOL/CMOS hardware architectures and performance/price for Bayesian memory – The building block of intelligent systems, Ph.D. dissertation, Department of Electrical and Computer Engineering, Portland State University, Portland, OR, October 2009Google Scholar
- 21.K.L. Rice, T.M. Taha, C.N. Vutsinas, Scaling analysis of a neocortex inspired cognitive model on the Cray XD1, J. Supercomput. 47, 21–43 (2009)Google Scholar
- 22.D. George, A mathematical canonical cortical circuit model that can help build future-proof parallel architecture, Workshop on Technology Maturity for Adaptive Massively Parallel Computing (Intel Inc., Portland, OR, March 2009), http://www.technologydashboard.com/adaptivecomputing/Presentations/MPAC%20Portland__Dileep.pdf
- 23.C. Gao, M.S. Zaveri, D. Hammerstrom, CMOS / CMOL architectures for spiking cortical column, in Proceedings of IEEE World Congress on Computational Intelligence – International Joint Conference on Neural Networks, Hong Kong, 2008, pp. 2442–2449Google Scholar
- 24.E. Rechtin, The art of systems architecting, IEEE Spectrum 29, 66–69, 1992Google Scholar
- 25.D. Hammerstrom, Digital VLSI for neural networks, in The Handbook of Brain Theory and Neural Networks, ed. by M.A. Arbib (MIT Press, Cambridge, MA, 1998), pp. 304–309Google Scholar
- 26.J. Bailey, D. Hammerstrom, Why VLSI implementations of associative VLCNs require connectionmultiplexing, in Proceedings of IEEE International Conference on Neural Networks (San Diego, CA, 1988), pp. 173–180Google Scholar
- 27.J. Schemmel, J. Fieres, K. Meier, Wafer-scale integration of analog neural networks, in Proc. IEEE World Congress on Computational Intelligence – International Joint Conference on Neural Networks (Hong Kong, 2008), pp. 431–438Google Scholar
- 28.K.A. Boahen, Point-to-point connectivity between neuromorphic chips using address events. IEEE Trans. Circ. Syst. II: Anal. Dig. Sig. Process. 47, 416–434 (2000)MATHCrossRefGoogle Scholar
- 29.D. George, B. Jaros, The HTM learning algorithms (Numenta Inc., Menlo Park, CA, Whitepaper, March 2007), http://www.numenta.com/for-developers/education/Numenta_HTM_Learning_Algos.pdf
- 30.K.L. Rice, T.M. Taha, and C.N. Vutsinas, Hardware acceleration of image recognition through a visual cortex model, Optics Laser Tech. 40, 795–802 (2008)Google Scholar
- 31.C.N. Vutsinas, T.M. Taha, K.L. Rice, A neocortex model implementation on reconfigurable logic with streaming memory, in IEEE International Symposium on Parallel and Distributed Processing (Miami, FL, 2008), pp. 1–8Google Scholar
- 32.R.C. O’Reilly, Y. Munakata, J.L. McClelland, Computational Explorations in Cognitive Neuroscience: Understanding the Mind by Simulating the Brain, 1st edn. (MIT Press, Cambridge, MA, 2000)Google Scholar
- 33.J. Hawkins, D. George, Hierarchical temporal memory: Concepts, theory and terminology (Numenta Inc., Menlo Park, CA, Whitepaper, March 2007), http://www.numenta.com/Numenta_HTM_Concepts.pdf
- 34.J. Hawkins, S. Blakeslee, On Intelligence (New York: Times Books, Henry Holt, 2004)Google Scholar
- 35.D. Hammerstrom, M.S. Zaveri, Bayesian memory, a possible hardware building block for intelligent systems, AAAI Fall Symp. Series on Biologically Inspired Cognitive Architectures (Arlington, VA) (AAAI Press, Menlo Park, CA, TR FS-08–04, Nov. 2008), p. 81Google Scholar
- 36.J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference (Morgan Kaufmann, San Francisco, CA, 1988)Google Scholar
- 37.R. Genov, G. Cauwenberghs, Charge-mode parallel architecture for vector–matrix multiplication, IEEE Trans. Circ. Syst.-II 48, 930–936 (2001)Google Scholar
- 38.B. Murmann, Digitally assisted analog circuits, IEEE Micro 26, 38–47 (2006)Google Scholar
- 39.C. Johansson, A. Lansner, Towards cortex sized artificial neural systems, Neural Networks 20, 48–61 (2007)Google Scholar
- 40.R. Granger, Brain circuit implementation: high-precision computation from low-precision components, in Replacement Parts for the Brain, ed. by T. Berger, D. Glanzman (MIT Press, Cambridge, MA, 2005), pp. 277–294Google Scholar
- 41.S. Minghua, A. Bermak, An efficient digital VLSI implementation of Gaussian mixture models-based classifier, IEEE Trans.VLSI Syst. 14, 962–974 (2006)Google Scholar
- 42.D.B. Strukov, K.K. Likharev, Defect-tolerant architectures for nanoelectronic crossbar memories, J. Nanosci. Nanotechnol. 7, 151–167 (2007)Google Scholar
- 43.K.K. Likharev, D.B. Strukov, Prospects for the development of digital CMOL circuits, in Proceedings of International Symposium on Nanoscale Architectures (San Jose, CA, 2007), pp. 109–116Google Scholar
- 44.J.M. Rabaey, Digital Integrated Circuits: A Design Perspective (Prentice Hall, Upper Saddle River, NJ, 1996)Google Scholar
- 45.N. Weste, D. Harris, CMOS VLSI Design - A Circuits and Systems Perspective, 3rd edn. (Addison Wesley/Pearson, Boston, MA, 2004)Google Scholar
- 46.M. Haselman, M. Beauchamp, A. Wood, S. Hauck, K. Underwood, K.S. Hemmert, A comparison of floating point and logarithmic number systems for FPGAs, in 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (Napa, CA, 2005), pp. 181–190Google Scholar
- 47.K.K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation (Wiley, New York, 1999)Google Scholar
- 48.K. Seungchul, L. Yongjoo, J. Wookyeong, L. Yongsurk, Low cost floating point arithmetic unit design, in Proceedings of IEEE Asia-Pacific Conference on ASIC (Taipei, Taiwan, 2002), pp. 217–220Google Scholar
- 49.D.M. Lewis, 114 MFLOPS logarithmic number system arithmetic unit for DSP applications, IEEE J. Solid-St. Circ. 30, 1547–1553 (1995)Google Scholar
- 50.P.C. Yu, H.-S. Lee, A 2.5-V, 12-b, 5-MSample/s pipelined CMOS ADC, IEEE J. Solid-St. Circ. 31, 1854–1861 (1996)Google Scholar
- 51.T.E. Williams, M.A. Horowitz, A Zero-overhead self-timed 160-ns 54-b CMOS divider. IEEE J. Solid-St. Circ. 26, 1651–1662 (1991)CrossRefGoogle Scholar
- 52.G. Gielen, R. Rutenbar, S. Borkar, R. Brodersen, J.-H. Chern, E. Naviasky, D. Saias, C. Sodini, Tomorrow’s analog: just dead or just different? in 43rd ACM/IEEE Design Automation Conference (San Francisco, CA, 2006), pp. 709–710Google Scholar
- 53.J.N. Coleman, E.I. Chester, A 32-Bit logarithmic arithmetic unit and its performance compared to floating-point, in Proceedings of 14th IEEE Symposium on Computer Arithmetic (Adelaide, Australia, 1994), pp. 142–151Google Scholar
- 54.C. Gao, Hardware architectures and implementations for associative memories – The building blocks of hierarchically distributed memories, Ph.D. dissertation, Department of Electrical and Computer Engineering, Portland State University, Portland, OR, Nov 2008Google Scholar
- 55.P. Narayanan, T. Wang, M. Leuchtenburg, C.A. Moritz, Comparison of analog and digital nanosystems: Issues for the nano-architect, in Proc. 2nd IEEE International Nanoelectronics Conference (Shanghai, China, 2008), pp. 1003–1008Google Scholar
- 56.D. George, J. Hawkins, Belief propagation and wiring length optimization as organizing principles for cortical microcircuits (Numenta Inc., Menlo Park, CA, 2007), http://www.stanford.edu/~dil/invariance/Download/CorticalCircuits.pdf
- 57.K.K. Likharev, Hybrid CMOS/nanoelectronic circuits: opportunities and challenges. J. Nanoelectron. Optoelectron. 3, 203–230 (2008)Google Scholar
- 58.C. Gao, D. Hammerstrom, CMOL based cortical models, in Emerging brain-inspired nano-architectures, ed. by V. Beiu and U. Rückert (Singapore: World Scientific, 2008 in press)Google Scholar
- 59.M. Holler, S. Tam, H. Castro, R. Benson, An electrically trainable artificial neural network (ETANN) with 10240 “floating gate” synapses, in International Joint Conference on Neural Networks (San Diego, CA, 1989), pp. 191–196Google Scholar
- 60.S.H. Jo, K.-H. Kim, W. Lu, Programmable resistance switching in nanoscale two-terminal devices. Nano Lett. 9, 496–500 (2009)Google Scholar