CMOL/CMOS Implementations of Bayesian Inference Engine: Digital and Mixed-Signal Architectures and Performance/Price – A Hardware Design Space Exploration

Chapter
Part of the Analog Circuits and Signal Processing book series (ACSP)

Abstract

In this chapter, we focus on aspects of the hardware implementation of the Bayesian inference framework within the George and Hawkins’ computational model of the visual cortex. This framework is based on Judea Pearl’s Belief Propagation. We then present a “hardware design space exploration” methodology for implementing and analyzing the (digital and mixed-signal) hardware for the Bayesian (polytree) inference framework. This particular methodology involves: analyzing the computational/operational cost and the related micro-architecture, exploring candidate hardware components, proposing various custom architectures using both traditional CMOS and hybrid nanotechnology CMOL, and investigating the baseline performance/price of these hardware architectures. The results suggest that hybrid nanotechnology is a promising candidate to implement Bayesian inference. Such implementations utilize the very high density storage/computation benefits of these new nano-scale technologies much more efficiently; for example, the throughput per 858 mm2 (TPM) obtained for CMOL based architectures is 32–40 times better than the TPM for a CMOS based multiprocessor/multi-FPGA system, and almost 2000 times better than the TPM for a single PC implementation. The assessment of such hypothetical hardware architectures provides a baseline for large-scale implementations of Bayesian inference, and in general, will help guide research trends in intelligent computing (including neuro/cognitive Bayesian systems), and the use of radical new device and circuit technology in these systems.

Keywords

Bayesian Inference Pearl - belief propagation Cortex CMOS CMOL Nanotechnology Nanogrid Digital Mixed-signal Hardware Nanoarchitectures Methodology Performance Price 

4.1 Introduction

The semiconductor/VLSI industry has been following Moore’s law since the past several decades, and has made tremendous progress, but as it approaches the lower nano-scale regime, it faces several challenges, including power density, interconnect reverse scaling, device defects and variability, memory bandwidth limitations, performance overkill1, density overkill2, and increasing design complexity [1, 2, 3, 4, 5, 6].

The two most important aspects of technology scaling are speed and density. The transistor density available with modern state of the art μ-processors has almost reached one billion per chip [7]. Transistor switching delays are only a few picoseconds, and now these processors are moving to multiple core architectures, which continue to improve the performance [7]. Emerging nanotechnologies will further increase the densities by two to three orders of magnitude [5]. However, it has become increasingly difficult to efficiently use all these transistors [6]. In addition, volume markets, for the most part, are no longer performance driven [6]. If we exclude general-purpose processors/architectures, an important challenge then is: Where else can we efficient utilize silicon technology, and its capabilities: speed, density, and multicore concepts?

One potential “architectural solution” to some of the above challenges includes hardware architectures for applications/models in the neuro-inspired and intelligent computing domain [8]. Most of these applications are inherently compute and memory intensive, and also massively parallel; consequently, it is possible that such applications would be able to fully utilize nano-scale silicon technology much more efficiently.

In recent years, nanoelectronics has made tremendous progress, but Borkar (from Intel®) [4] indicates that, as yet, there is no emerging nanoelectronic candidate with the potential for replacing, within the next 10–15 years, Complementary Metal-Oxide-Semiconductor (CMOS) as it is being used today in state of the art μ-processors. However, recent research has shown that certain kinds of nano-scale electronics, when used in hybrid configurations with existing CMOS, could be useful in creating new computing structures, as well as high density memory, and hybrid and neuromorphic architectures [3, 5, 6, 9, 10, 11, 12, 13]. One of the most promising of these hybrid technologies is CMOL, developed by Likharev [11], which is used in this work.

Another challenge is that of intelligent computing [6]. The term “Intelligent Computing” encompasses all the difficult problems that require the computer to find complex structures and relationships through space and time in massive quantities of low precision, ambiguous, noisy data [2]. The term Intelligent Signal Processing (ISP), has been used to describe algorithms and techniques that involve the creation, efficient representation, and effective utilization of large complex models of semantic and syntactic relationships [2, 6]. ISP augments and enhances existing Digital Signal Processing (DSP) by incorporating such contextual and higher level knowledge of the application domain into the data transformation process [2, 6]. Several areas in intelligent computing, such as AI, machine learning, fuzzy logic, etc. have made significant progress; however, “general-purpose artificial intelligence has remained elusive” [14], and we still do not have robust solutions to even remotely approach the capabilities of biological/cortical systems [6]. Consequently, a number of researchers are returning to neuro and cognitive sciences to search for new kinds of Biologically Inspired Computational Models (BICMs) and “computing paradigms” [2, 15]. Recently, the neuroscience community has begun developing scalable Bayesian computational models (based on Bayesian inference framework [16, 17, 18]) inspired from the “systems” level structural and functional properties of the cortex, including the feedforward and feedback interactions observed in the “subsystem” layers of the visual cortex, and integrating them with some higher-level cognitive phenomena [17, 19, 20]. These computational models, also referred to as “Biologically Inspired Computational Models” (BICMs) [2], have the potential of being applied to large-scale applications, such as, speech recognition, computer vision, image content recognition, robotic control, and making sense of massive quantities of data [6, 21]. Some of these new algorithms are ideal candidates for large-scale hardware investigation (and future implementation), especially if they can leverage the high density processing/storage advantages of hybrid nanoelectronics [3, 5].

The effective design and use of these nanoelectronic structures requires an application-driven total systems solution [5, 6]. Consequently, in this chapter we propose a new family of application-specific hardware architectures, which implement Bayesian (polytree) inference algorithms (inspired from computational neuroscience), using traditional CMOS and hybrid CMOS + nanogrid (CMOL3) technologies.

We began this work by choosing the George and Hawkins’ Model (GHM) [16] of the visual cortex, as a useful family of Bayesian BICMs. Recently, George and Hawkins gave a prognosis about the hardware platforms for their model, and indicated that, learning algorithms will be initially implemented in parallel software, partly because they are still evolving, while the inference framework could be implemented in embedded hardware [22]. One of the biggest problems with Bayesian inference in general is its compute intensiveness (it has been shown to be NP-Hard [6]).

Consequently, in this chapter, we exclusively focus on the Bayesian inference functionality of the GHM, and then develop a set of baseline hardware architectures and their relative performance/price measures using both traditional CMOS and a hypothetical CMOL technology. In the larger perspective, we perform a “hardware design space exploration” (in a limited subset of the space), which is a first step in studying radical new computing models and implementation technologies. The architecture evaluation methodology used here (and elsewhere [3, 23]) has the potential of being applied to a broader range of models and implementation technologies.

This chapter is organized as follows: Section 4.2 provides a brief introduction to the more generic concept of hardware virtualization spectrum, and provides the methodology of “architecting” the hardware [24]. Section 4.3 describes the Bayesian Memory (BM) module, and its Bayesian inference framework. Section 4.4 defines various hardware architectures that support BM. Sections 4.5 and 4.6 discusses CMOS/CMOL digital and mixed-signal hardware architectures for BM. Finally, we discuss the hardware performance/price results in Section 4.7.

4.2 Hardware for Computational Models

4.2.1 Hardware Virtualization Spectrum

When creating a baseline architecture for implementing any computational model or algorithm (from several domains, such as neural networks, intelligent computing, etc.), the one single decision that has the greatest impact on the performance/price of the implementation is the degree of “virtualization” of the computation. This term, incidentally should not be confused with virtual machines. We use virtualization to mean “the degree of time-multiplexing of the ‘components’ of computation via hardware resources” [3]. Since some of the computational models consist of fine grained networks, very fine-grained parallel implementation is generally possible, but varying the degree of virtualization allows us to make a wider range of trade-offs for a more efficient implementation.

A typical “hardware virtualization spectrum” is shown in Fig. 4.1. It essentially describes a software and hardware design space for implementing massively parallel algorithms [3]. As we move from left to right, the time-multiplexing of hardware resources decreases and parallelism increases. Likewise, the general flexibility/programmability decreases.
Fig. 4.1

Hardware virtualization spectrum (Adapted from [3], © 2007 IEEE). Numbered boxes correspond to various hardware design options, and are referred to as “region 1” through “region 8”

The right bottom corner represents a design in which the algorithm is fully hardwired into the silicon (for example, analog architectures in region 8), it achieves maximum parallelism, but has little or no virtualization, and is the least flexible [3]. Such designs are generally very fast and compact [25]. However, in general, it is difficult to multiplex analog components [25, 26], so as we move to the left, towards more virtualization, digital design tends to dominate. However, even in the analog domain, analog communication is not feasible, so digital multiplexed communication is used [25, 26, 27, 28]. For some particular computational models, depending on their dynamics, such designs could be an inefficient implementation [3].

Previous work (other than GHM) has shown that, with possible future hardware technologies, semi-virtualized hardware designs (regions 5–7) will scale well, because they allow large-scale integration and component level multiplexing, and are suitable for multiplexed communication schemes [3, 10, 23].

For several application domains, designs from region 5–7, have not generally been explored, and hence, warrant research and investigation.

4.2.2 Existing Hardware Implementations of George and Hawkins’ Model

The work by George and Hawkins [16, 29] (and the related work at Numenta Inc.) mostly falls into region 1, and only a little bit into region 3. The recent work by Taha [21, 30, 31] explores the combined implementations in regions 3 and 4, for the George and Hawkins Model (GHM) [16]. At this time, we are not aware of any work on custom hardware implementation of the GHM [16]. Hence, we can conclude that most of the work on the hardware implementation of GHM [16], is almost exclusively concentrated in region 1, 3 and 4, and hence, custom hardware implementations in regions 5–7 are still unexplored and present an opportunity for a different kind of hardware/computer architecture research. Consequently, the work presented here explores application-specific hardware designs/architectures4 for the Bayesian inference framework (i.e. Pearl’s Belief Propagation Algorithm for polytree) within the GHM [16], in regions 5–7, and investigates the hardware performance/price of these architectures.

Note: The analog community has not yet explored Region 8 for GHM, but in the future, with appropriate research, interesting fully analog designs are no doubt possible. It is not in the scope of this chapter to cover fully analog designs, because the focus of this work is to explore designs which can transition to, and utilize CMOL nanogrid structures. In addition, we do not claim that the designs proposed in this work are better than potential fully analog designs.

It should be realized that there are no universal benchmarks or baseline configurations available to allow a comparison of the hardware for GHM or similar models [3]. In addition, factors that vary over the various models, such as data precision, vector/matrix sizes, algorithmic configurations, and hardware/circuit options, make these comparisons even more difficult. The baseline configuration for any hardware takes into account conservative assumptions & worst case design considerations, the design tends to be more virtualized and perhaps less optimized; hence, the price (area & power) and performance (speed) are on the lower-end, or at the baseline. Consequently, in the future, sophisticated designs would probably have better performance, but at a higher price.

4.2.3 Hardware Design Space Exploration: An Architecture Assessment Methodology

Figure 4.2 summarizes the “hardware design space exploration” methodology5 used here for implementing the Bayesian inference framework, and investigating their performance/price. Our previous work on associative memory (digital and mixed-signal) implementations [3, 23], did not formally summarize/discuss this methodology, but was based on it; a brief example showing the use of this methodology for an associative memory model is given in the Appendix (Section 4.9.5). In this chapter, with respect to the steps in Fig. 4.2, Section 4.3 covers steps 1–4; Section 4.4 covers steps 6–8; and Sections 4.5 to 4.7 cover steps 5 and 9–12.
Fig. 4.2

Hardware design space exploration – methodology

4.3 A Bayesian Memory (BM) Module

This work is based on the George and Hawkins Model (GHM) [16]. GHM has several desirable characteristics which include: the underlying fundamental concepts have a high degree of “interpretability” [32], the perception of time, and a basis in some proposed cognitive mechanisms [33, 34]. It has a modular and hierarchical structure, which allows shared and distributed storage; and is scalable [32, 33]. It also has a simpler non-“loopy” probabilistic framework as compared to [18]. All these characteristics potentially make the GHM [16] the simplest and most generic Bayesian model currently available in the neuroscience community. For all these reasons, it is a good candidate Bayesian model for investigating the related baseline hardware.

In this chapter, each individual module of the GHM [16], is referred to as a “Bayesian Memory” (BM) module [35], as shown in Fig. 4.3a. The BM module has two conceptual/functional parts (Part-A and Part-B), as shown in Fig. 4.3a. Part-A is largely involved in learning, consists of a set of quantized vectors, and is referred to as the Code Book (CB) of vectors [35]. During learning, the CB captures and accumulates the data on the inputs that the network sees [16, 35]. The conditional probability relations between these CB vectors are also captured during learning, which is used later for inference [16, 35]. Part-B is the Bayesian inference framework, it performs inference, and is based on the bidirectional Belief Propagation Algorithm (BPA), pioneered by Pearl [36]. Figure 4.3b shows the details of Part-B, with a single parent BM and multiple child BMs [35]. A hierarchical system for recognition and related tasks can be built using BM modules, as shown in Fig. 4.3c.
Fig. 4.3

Bayesian Memory (BM) module. (a) Basic parts of a BM. (b) Detailed Part-B of a BM. (c) Hierarchical system of BMs

Pearl’s BPA [36] for a polytree is summarized by Eqs. 4.1–4.6, which use a matrix/vector notation. As shown in Fig. 4.3b, the particular module of interest is called BM-y. BM-y has nc child modules, each denoted by BM-xi, where i = 1, …, nc. BM-y has a single parent module denoted by BM-z. BM-y exchanges probabilistic messages/vectors with its parent and children. In 4.1, notation (xi,y) denotes that a message goes from BM-xi to BM-y, and notation (y) denotes an internal message. The same style applies for 4.2–4.6. Equations 4.5 and 4.6 are equivalent, but 4.6 is easier to implement. Notation ⊗ denotes a ∏ operation, but for only two vectors. If there are nc child BMs, then 4.6 has to be evaluated nc times, corresponding to each child BM-xi.

This chapter focuses almost exclusively on the implementation of the Bayesian inference framework (Part-B) of the GHM [16], and henceforth, the term BM denotes only Part-B, unless explicitly stated otherwise.
$$ l(y)={\displaystyle \prod _{i=1}^{{n}_{c}}l({x}_{i},y)}$$
(4.1)
$$ l(y,z)=M\times l(y)$$
(4.2)
$$ p(y)=p(z,y)\times M$$
(4.3)
$$ B(y)=ap(y)\otimes l{(y)}^{\prime }$$
(4.4)
$$ p(y,{x}_{i})=aB(y)/l({x}_{i},y{)}^{\prime }$$
(4.5)
$$ p(y,{x}_{i})=ap(y)\otimes \left({\displaystyle \prod _{k=1,k\ne i}^{{n}_{c}}l({x}_{i},y{)}^{\prime }}\right)$$
(4.6)

The following discussion briefly summarizes the significance of the variables and operations in the equations of the BPA.

λ(xi,y) is the input diagnostic message6 coming from child BM-xi, λ(y) is the internal diagnostic message of BM-y, and λ(y,z) is the output diagnostic message going to the parent BM-z.

π(z,y) is the input causal message coming from parent BM-z, π(y) is the internal causal message of BM-y, and π(y,xi) is the output causal message going to the child BM-xi.

λ(xi,y) represents the belief7 that BM-xi has in the CB of BM-y. λ(y,z) represents the belief that BM-y has in the CB of parent BM-z. π(z,y) represents the belief that parent BM-z has in its own CB, while ignoring the diagnostic message λ(y,z). π(y,xi) represents the belief that BM-y has in its own CB, while ignoring the diagnostic message λ(xi,y).

M acts as a bidirectional probabilistic associative memory, which stores the probabilistic relationships between the CB of BM-y and BM-z. M is also called the Conditional Probability Table (CPT). B(y) represents the final belief that BM-y has in its own CB, considering all possible diagnostic and causal messages.

– In 4.1, λ(y) is formed by combining all the input diagnostic messages using the ∏ operation. In 4.2, λ(y,z) is formed using a [matrix × vector] operation that converts the internal diagnostic message into the equivalent output diagnostic message. In 4.3, π(y) is formed using a [vector × matrix] operation that converts the input causal message to the equivalent internal causal message. In 4.4, B(y) is formed by combining internal diagnostic and causal messages using ⊗ operation. In 4.6, π(y,xi) is formed by combining internal causal messages and all input diagnostic messages, except λ(xi,y), using a ∏ operation.

In short, the BPA can be summarized as: each BM in a hierarchical system (like the one shown in Fig. 4.3c) first updates its internal belief using the incoming belief messages from its parent and children, and later, sends the updated version of its belief back to its parent and children.

4.4 Hardware Architectures for Bayesian Memory

4.4.1 Definition of Hardware Architectures for BM

The hardware architectures investigated in this chapter are categorized in Fig. 4.4. (The categorization and hardware definitions presented here are partly based on the hardware configurations proposed in [3, 6].) There are two major groups of architectures: digital and mixed-signal. The digital architectures are categorized into two groups: digital CMOS and digital CMOL. Each is further categorized into three basic “computational” groups: Fixed-Point (FXP), Logarithmic Number System (LNS), and Floating-Point (FLP) architectures. FXP architectures are further categorized based on the data precision, i.e. 4 bits, 8 bits, 12 bits, …, 32 bits, and are referred to as FXP4, FXP8, FXP12, …, FXP32 respectively. The mixed-signal architectures are categorized into two groups: mixed-signal CMOS and mixed-signal CMOL architectures.
Fig. 4.4

Categorization of hardware architectures for Bayesian Memory

All of the architectures studied here implement Eqs. 4.1–4.4, and 4.6 of the BPA. These equations can be categorized based on the type of matrix/vector operations: “Vector-Matrix Multiplication” (VMM) [37], Product (Π Off Vectors (POV), and Sum-Normalization (SNL). VMM has two subtypes: (Matrix × Column-Vector) (MCV) and (Row-Vector × Matrix) (RVM). Finally, Eqs. 4.1–4.4, and 4.6 correspond to operations POV, MCV, RVM (POV & SNL), and (POV & SNL) respectively. To implement these matrix/vector operations in hardware requires several different circuit components. Hence, the exact definition of any architecture depends on the particular circuit/hardware components that it uses to implement the equations of the BPA.

Table 4.1 summarizes the definitions of the (limited) architecture space explored here; it provides an overview of how the arithmetic operations, storage, and communication for each equation are implemented for those architectures. The digital CMOS and digital CMOL architectures are almost identical, except for the memory components; digital CMOS architectures use SRAM8, while digital CMOL architectures use CMOL memory [3].
Table 4.1

Definition of hardware architectures for Bayesian memory (BM)

Component

 

Equation 4.1

Equation 4.3

Equation 4.4

Equation 4.6

Equation 4.2

S

O

C

S

O

C

S

O

C

S

O

C

S

O

C

Memory (MEM)

SRAM

♣♥

  

♣♥

  

♣♥

  

♣♥

  

♣♥

  

CMOL MEM

♦♠

  

♦♠

  

♦♠

  

♦♠

  

♦♠

  

Conventional arithmetic/ logic components

Digital CMOS

 

♣♦♥♠

  

♣♦♥♠

  

♣♦♥♠

  

♣♦♥♠

  

♣♦♥♠

 

Mixed-signal CMOS

    

♥♠

        

♥♠

 

Structure based components

(storage & computation)

Mixed-signal CMOS

   

       

 

Mixed-signal CMOL nanogrid

   

       

 

MEM access based communication

SRAM

  

♣♥

  

♣♥

  

♣♥

  

♣♥

  

♣♥

CMOL MEM

  

♦♠

  

♦♠

  

♦♠

  

♦♠

  

♦♠

♣ = Digital CMOS architecture; ♦ = Digital CMOL architecture;

♥ = Mixed-signal CMOS architecture; ♠ = Mixed-signal CMOL architecture

S = storage; O = arithmetic Operations; C = communication

The mixed-signal architectures are based on the “Internally Analog, Externally Digital” (IAED) Structure for VMM (SVMM), proposed in [37]. This is a mixed-signal array based SVMM, which combines storage, signal-communication, and computation; storage and input/output are digital, and the analog computations occur along the wires as small increments in charge/current [3, 37]. The difference between the digital and mixed-signal architectures is that Eqs. 4.2 and 4.3 are partially implemented using mixed-signal SVMM or structure based components, because these equations are computationally the most intensive; the implementation of the remaining equations is the same as in digital architectures. The mixed-signal architectures also require conventional digital CMOS arithmetic/logic components and conventional mixed-signal CMOS components for post processing in 4.2 and 4.3.

When selecting a mixed-signal architecture, the single most important criterion was to select a mixed-signal CMOS design, which could later be easily mapped to CMOL nanogrid structures; the SVMM proposed in [37] was such a match, and in addition, it has its advantages and is close to the future realm of “digitally assisted analog circuits” [38]. Consequently, our mixed-signal architecture judiciously replaces some of the important operations using analog/mixed-signal components, according to the CMOS/CMOL design considerations suggested in [3].

The storage in all architectures is digital/binary-encoded. Consequently, it is easier to implement a memory access (read & write) based communication for all architectures.

4.4.2 General Issues

The general issues and assumptions related to our analysis of the hardware implementation of the BM are now briefly discussed.

Precision/Bits

The data precision (nbit = bits representing a single element) required to implement any computational model is dependent on the algorithm, the structural configurations, the application domain, etc. Studies [39, 40], suggest that neuromorphic hardware generally requires a precision of 4–8 bits, but Bayesian methods, which are higher level, more abstract models, when applied to real-world problems generally require 8–16 bits for sufficient accuracy [21, 41]. In this work we have varied the precision over a range of values to understand its contribution to the overall performance/price trade-off. The precisions used here were: 4–32 bit FXP, 32 bit FLP, and 32 bit LNS. All architectures, even mixed-signal, store/communicate data in digital/binary-encoded format.

Communication

Communication is generally sequential and often involves memory access. Internal communication (for internal variables) is as simple as accessing an internal memory. External communication (between parent/child) requires inter-module buses with data, address, and control signals. Depending on the particular variable being read by BM-y, some of the communication is assumed to be virtualized. To evaluate 4.1, BM-y needs to read λ(xi,y) from each of its child BM-xi; for this variable, one communication bus is multiplexed across all nc child BMs. To evaluate 4.3, BM-y needs to read π(z,y) from its parent BM-z; for this variable a separate communication bus is assumed. It should be noted that λ(xi,y) is required to evaluate 4.1 and 4.6; also, 4.6 is evaluated nc times. Hence, each λ(xi,y) value is used in multiple equations. So instead of reading from the child BM-xi multiple times through the virtualized communication bus, we assume that it is read once to evaluate 4.1, and a local copy9 is stored in BM-y that can be used later for 4.6. In fact, we assume this for all variables that are read from parent or child BMs.

Number of Parent and Child BMs, and Code Book (CB) Size

The BPA used in the BM is for singly connected polytree structures, which means that each BM has a single parent BM, and multiple child BMs. Polytrees allow exact inference to be performed in one iteration of messages [36]. We assume four child BMs (nc = 4) as input to each BM, which will allow the receptive field-sizes to increase by 4× in each layer of a typical hierarchy [29]. For the analysis done here, we assume that Part-A of any BM can learn a maximum of 4,096 (or 4K) CB vectors. Hence, the CB size (nCBy) of BM-y and the CB size (nCBz) of parent BM-z are both assumed to be 4 K. These CB sizes determine the size of all the variables in Eqs. 4.1–4.6. Table 4.2 summarizes the specifications (variables and their sizes) of a BM; the hardware analysis is based on these specifications. From Table 4.2 it can be seen that the largest storage required is for the M matrix. Depending on a particular application, M could be sparse, but here we assume the worst case full density M.
Table 4.2

Specifications of a Bayesian memory (BM) module

Variable

Size

nbit

4–32 bit FXP, 32 bit FLP, and 32 bit LNS

nc

4

nCBy

4,096

nCBz

4,096

λ(xi,y)

[4,096 × 1]

λ(y)

[4,096 × 1]

λ(y,z)

[4,096 × 1]

π(y,xi)

[1 × 4,096]

π(z,y)

[1 × 4,096]

π(y)

[1 × 4,096]

B(y)

[1 × 4,096]

M

[4,096 × 4,096]

Virtualization

In general, we have the following two ways to virtualize the arithmetic/logic components: virtualize (share resources) these components for similar operations over several BMs, or virtualize different operations/equations within each BM. We chose the latter because it keeps the hardware analysis restricted to one BM, and is more suitable for systems with distributed memory. Most of the designs considered here tend to be more virtualized, which provides a more favorable performance/price. However, as we shall see later, the cost of most BM architectures (except the mixed-signal CMOL architecture) is dominated by memory. Consequently, extreme virtualization of the arithmetic/logic computations may not be a significant advantage; but for the mixed-signal CMOL architecture, virtualization was the only way to make the architecture feasible and scalable.

Hybrid Nanotechnology – CMOL

The nanotechnology components used in this chapter are based on CMOL, which is a hybrid CMOS + nanogrid architecture developed by Likharev [11]. CMOL is mostly used here as digital memory and as an application specific, mixed-signal computational structure for the VMM operation. We use the CMOL/nanogrid modeling described in [3], and follow a similar strategy for the performance/price analysis of the CMOL components. Some of the assumptions for the CMOL analysis are:

–For CMOL memory, read and write can be performed using a common interface/path (row and column decoders, etc.). Only for the variables communicated between parent/child BMs, will we need a simultaneous read and write capability. According to [42], all these assumptions are plausible.

–The area analysis does not consider the CMOS to nanogrid routing (similar to [3]), i.e., the CMOS layer and nanogrid layer are analyzed separately. The total area is taken as the sum of CMOS and nanogrid layers; which is a worst case consideration.

–Our analysis uses a future value of Fnano = 3 nm (projected in 2028 [43]). CMOL technology should be mature by then, hence plausible defect rates would be around 2% to 5% [42]; and a recent roadmap suggests defect rates as low as 0.03% [43]. So in the worst case, the hardware overhead for defect-tolerance through redundancy/reconfiguration would be around 5% to 10% of the results shown in this chapter.

4.5 Digital CMOS and CMOL Hardware Architectures for Bayesian Memory (BM)

The all digital architectures considered here assume traditional CMOS arithmetic/logic components. The only difference between digital CMOS and digital CMOL architecture is the storage medium. The digital CMOS architectures use SRAM, while digital CMOL architectures use CMOL memory [3].

In our analysis, we assumed a custom [4K × 4K × 32 bit] CMOL memory, instead of using the results from [42]. In the BM application assumed here, all matrix/vector elements are accessed sequentially. Hence, our memory does not require random access decoders. So we were able to use shift register based decoders for sequential access [44], which reduces the total silicon area for the memory. This also decreases memory access time due to the decreased logic-depth of the decoder circuitry. Moreover, defect-tolerance can be achieved simply through reconfiguration to redundant/new nanogrid locations, not requiring special error-correction circuitry. A block diagram of the CMOL memory is shown in Fig. 4.5b. Each block consists of a [256 × 256] CMOL nanogrid, along with the necessary decoders, as shown in Fig. 4.5a. All row decoders connect to horizontal nanowires; all column decoders are assumed to connect to pass transistors/gates. We analyzed the performance/price of this particular CMOL memory, generalizing its performance/price to a single bit, and used this result to evaluate memories of varying sizes. The performance/price analysis assumes that 10% of the total memory dissipates power during memory access [45]; this is a worst case consideration, because ideally, only one block (out of 8,192) would be ­drawing current.
Fig. 4.5

CMOL memory: (a) One block of CMOL memory with a [256 × 256] nanogrid and decoders, (b) general block diagram of CMOL memory [4K × 4K × 32 Bits]

4.5.1 Floating-Point (FLP) Architecture

The FLP architecture uses a 32 bit single precision floating-point data format [46]. The block diagram of the FLP architecture is show in Fig. 4.6. Blocks B1 to B5 perform the operations for Eqs. 4.1–4.4 and 4.6, respectively. Block B6 performs the Sum-Normalization (SNL) operation for 4.4. A similar additional block (not shown) is also needed for the SNL operation in Eq. 4.6. Digital counters (CT) and logical Address Transformations (AT) are used to generate the proper addresses for the SRAMs corresponding to all variables. SRAM is used to store the M matrix and has a dual-ported read capability. We require only two Floating-Point arithmetic Units (FPUs), which are shared by (virtualized over) all blocks in Fig. 4.6. Block B5 is virtualized for computing variables π  (y,x1) to \( p(y,{x}_{{n}_{c}})\) (i.e. 4.6 is computed nc times, corresponding to each child BM). Table 4.3 shows the arithmetic and memory computations required for each equation in the BM. It also shows the FPU used by each equation.
Fig. 4.6

Block diagram of digital floating-point (FLP) architecture

Table 4.3

Computations for a Bayesian memory (BM) module

Equation

Matrix/vector operations

FLP multiplication

FLP addition

FLP division

Memory read

Memory write

FPU used

4.1

POV

ncnCBy

0

 

ncnCBy

nCBy

FPU1

4.2

MCV

nCBznCBy

nCBznCBy

0

nCBznCBy

nCBz

FPU1

4.3

RVM

nCBynCBz

nCBynCBz

0

nCBynCBz

nCBy

FPU2

4.4

POV

2⋅nCBy

0

0

2⋅nCBy

nCBy

FPU1

SNL

nCBy

nCBy

1

2⋅nCBy

nCBy

FPU1

4.6

POVa

nc(ncnCBy)

0

0

nc(ncnCBy)

nc(nCBy)

FPU2

SNLa

nc(nCBy)

nc(nCBy)

nc(1)

nc(2⋅nCBy)

nc(nCBy)

FPU2

aComputations shown are the total for all nc child BMs

Even though Table 4.3 specifically shows the FLP operations, it still provides a general overview of the computations within the BM. From Table 4.3, we can infer that 4.2 and 4.3 have a computational complexity of O(N1N2), where N1 µ nCBy and N2 µ nCBz; while 4.1, 4.4 and 4.6 have a computational complexity of O(N), where N µ nCBy and nc is negligible relative to nCBy.

4.5.2 Logarithmic Number System (LNS) Architecture

The LNS architecture is based on a 32 bit single precision LNS data format [46]. The LNS architecture is similar to that shown in Fig. 4.6, except that all FPUs are now replaced by Logarithmic Number system arithmetic Units (LNUs). The earlier discussion on FLP architecture also applies here. Table 4.3 can be used by replacing the FLP operations with LNS operations. Both the FLP and LNS architectures use a 32 bit data format; hence, the datapath structure remains the same, only the requirements with respect to arithmetic operations differ.

4.5.3 Fixed-Point (FXP) Architecture

The FXP architecture also has blocks B1 to B5, which are similar to those in Fig. 4.6. The FXP architecture does not require block B6. In the FXP architecture, each block in Fig. 4.6 is replaced with two blocks (the a & b type blocks, as shown in Fig. 4.7). For example, block B1 in Fig. 4.6 has to be replaced with block B1a and block B1b shown in Fig. 4.7. In the FLP architecture, the FPU in block B1 was able to perform both the add and multiply operations required for 4.2. But in the FXP architecture (as shown in block B1a) we need a separate digital adder and digital multiplier (of particular bit sizes) for 4.2. Using the same idea, each block in Fig. 4.6 has to be replaced by an equivalent block (with separate adders/multipliers) to form the FXP architecture in Fig. 4.7.
Fig. 4.7

Block diagram of digital fixed-point (FXP) architecture

The FXP architecture has to consider the bit size or precision of the intermediate results of all arithmetic operations. For example, multiplying two n bit numbers will lead to a 2n bit result. When m numbers, each of size n bits, are added, the result could grow to n + ceil(log2(m)) bits. If these results are simply truncated back to n bits, then error/noise will enter the system [47]. In order to maintain accuracy with the FXP architecture (to compare to the FLP and LNS architectures), and for the purposes of the worst case hardware analysis being done here, no intermediate results are truncated. Hence, to maintain intermediate results, we require digital components of different bit sizes, depending on the number and type of the operation. To maintain accuracy and reasonable functional equivalence to the FLP version, the FXP implementations should utilize the full dynamic range of the data [39] as much as possible. One way to accomplish this is to use range normalization. This is a conservative assumption, since it is likely that real implementations can get by with significant precision reduction even with fixed-point representations. However, to maintain reasonable functional equivalency, we have added such normalization to the FXP implementations. Range normalization typically has a hardware/operation overhead consisting of a digital adder/comparator, a multiplier and a divider, as shown in Fig. 4.7 (b type blocks).

In Fig. 4.7, Block B1a computes an intermediate version λtm(y,z), which is then range normalized by block B1b to obtain λ(y,z). Similarly blocks B2b to B5b, which are attached to the corresponding blocks B2a to B5a, perform range normalization. In general, we cannot10 virtualize a single range normalization block across all other blocks, because the sizes of the digital components in the range normalization blocks are dependent on the size of the intermediate results. In blocks B1b to B5b, cl, ch, dh denote current minimum, current maximum, and desired maximum respectively. These correspond to the minimum and maximum values/elements in a vector that is to be range normalized. As in the FLP architecture, blocks B5a and B5b are virtualized for computing variables π(y,x1) to \( p(y,{x}_{{n}_{c}})\). The overhead for the range normalization pushes the FXP architecture analysis towards an extreme worst case; i.e. most other simpler techniques for rounding/truncation will have a smaller overhead.

4.6 Mixed-Signal (MS) CMOS and CMOL Hardware Architectures for Bayesian Memory (BM)

Consider the Vector Matrix Multiplication (VMM) operation given by 4.7. Its computational complexity is O(pq). To compute one element (i.e. Yj) of the output vector Y roughly requires p (add & multiply) operations, as shown in 4.8. To compute all elements of Y, we need to repeat this process q times, hence the total time11 required is qp⋅(tadd+ tmult), when implemented sequentially. To reduce this time, many of the operations in 4.7 can be performed in parallel. The following discussion investigates the use of hardware components for parallel VMM operation of 4.2 and 4.3 in the BM.
$$ {Y}_{j}={\displaystyle \sum _{i=1}^{p}{X}_{i}{M}_{i,j}}$$
(4.7)
$$ {Y}_{j}={\displaystyle \sum _{i=1}^{p}{X}_{i}{M}_{i,j}}$$
(4.8)
$$ \begin{array}{l}{Y}_{1}={X}_{1}{M}_{1,1}+{X}_{2}{M}_{2,1}+{X}_{3}{M}_{3,1}=\dots \\ \\\\\\\dots [011\times 100]+[011\times 111]+[101\times 110]\end{array}$$
(4.9)

Cauwenberghs [37] proposed an “Internally Analog, Externally Digital” (IAED) mixed-signal array based structure for parallel VMM, for massive (100–10,000) matrix dimensions. The IAED-Structure for VMM (SVMM) effectively combines storage and analog computation, and is better than other analog-only techniques for the VMM computation [37]. Internally, the storage is digital. The array structure allows inherent analog processing [37]. Externally, it provides the convenience and precision of pre/post digital processing [37]. Hence, we use the SVMM as the fundamental operation implemented in our mixed-signal architectures. Our mixed-signal CMOS architecture partly virtualizes SVMM, and our mixed-signal CMOL architecture replaces some of the traditional components of the SVMM with nano components, providing a novel, mixed-signal CMOL nanogrid based SVMM.

In the traditional SVMM [37], all elements of Y are computed in parallel. Each element is internally computed in a semi-parallel fashion, requiring only a few iterations through the SVMM. The mixed-signal CMOS SVMM is shown in Fig. 4.8b. The number of iterations required is nbitx, which denotes the number of bits/precision used to represent each element Xi of vector X. Let tSVMM denote the time for a single pass through the SVMM. Then the total time required to complete the VMM operation is nbitxtSVMM, which is much less than qp⋅(tadd+ tmult) (for sequential operations), because nbitx << q, and tSVMM <12p⋅(tadd+ tmult). Since all the elements of vector Y are being computed in parallel, multiple ADCs and other arithmetic/logic components are required. For a very large M (such as in our case where M is [4K x 4K]), we would require thousands of ADCs and adders, which would be costly in terms of hardware area and power, compromising feasibility and significantly limiting scalability. For this reason, we virtualized the ADCs and arithmetic/logic components over all the elements of Y; so now each Yj is computed sequentially. In this case, the total time required to complete the VMM operation is qnbitxtSVMM. For a virtualized SVMM, we only achieve better performance than the sequential implementation if nbitxtSVMM < p⋅(tadd+ tmult); our hardware analysis will address this issue. Such a method of computing each element of Y separately fits well to our BM model, because communication and memory access are sequential in the BM.
Fig. 4.8

Mixed-signal computations and structures: (a) Example of a VMM computation decomposed into sub-operations, (b) mixed-signal CMOS SVMM, and (c) Mixed-signal CMOL nanogrid SVMM

The formal equations behind the SVMM operation are presented in more detail in [37], and so are not repeated here. Instead, the operation of these, “semi-virtualized,” mixed-signal CMOS and CMOL SVMM circuits are presented via a simple example.

4.6.1 Mixed-Signal CMOS Architecture

Consider a VMM operation where q = p = 3 and nbitx = 3. The computation required for the first element Y1 is given in 4.9, along with some example numbers for Xi and Mi,1. Each number is represented in “little endian” binary format (the leftmost bit is the MSB and the rightmost bit the LSB), for example X1 = 011 and M3,1 = 110. Figure 4.8a shows a decomposition of the VMM operation into its constituent ­sub-operations. To illustrate the SVMM function, we first concentrate on step-1 (a, b, and c). In Fig. 4.8a, step-1a shows the partial products resulting from the multiplication of the first bit (LSB) of each Xi with the corresponding bits of each Mi,1. The equivalent operation in the mixed-signal CMOS SVMM is shown in Fig. 4.8b. The first bit of each Xi, i.e. \( {X}_{i}^{0}\) (denoted by \( {X}_{i}^{b}\), where i = element index, b = bit index, and b = 0 to nbitx − 1) is presented from the left along the horizontal wires. Each cell (denoted by \( {M}_{i,1}^{b}\)) in the analog cell array, in Fig. 4.8b, is a “CID computational cell with integrated DRAM” [37]. The analog cell array corresponds to the first column of M in 4.7. Each cell can store one binary value, and can compute one binary multiplication. In the compute mode, it can contribute some charge/voltage on the vertical wire if its internal (stored) value is 1 and the external input is 1. For example, the cell \( {M}_{1,1}^{0}\) has internal value 1, and its external input \( {X}_{1}^{0}\) is also 1; hence it contributes some charge/voltage (shown by a vertical arrow Open image in new window) to the first vertical wire. Those cells that have an internal value of 0, do not contribute charge/voltage to the vertical wires. The analog cell array in Fig. 4.8b performs the equivalent of step-1a and the initial part of step-1b in Fig. 4.8a. The total charge/voltage (shown by multiple arrows) on each vertical wire is converted to digital outputs by the ADCs, and these outputs correspond to the partial sums shown in step-1b. The Shift Registers (SRs) and adders following the ADCs then complete step-1b. Step-1c is accomplished by the remaining SR and adder. This concludes step-1 or a single pass through the SVMM. In the next iteration we present the second bit of each Xi, i.e.\( {X}_{i}^{1}\), on the horizontal wires of the SVMM, to accomplish step-2 (a, b, and c). Step-3 is completed in a similar manner. In summary, to obtain (one element) Yj we have to iterate nbitx times through the SVMM.

For computing the remaining elements of vector Y, we require similar steps; but for each Yj we require a distinct analog cell array that corresponds to the jth column of M in 4.7. The input X remains the same, and is still presented bit-serially as described earlier. The ADCs and arithmetic/logic units are virtualized across all elements of Y.

In our hardware analysis, we assume a precision of nbitx = 8 for both X and M. This makes this architecture comparable to the digital FXP8 architecture. (The block diagram of mixed-signal CMOS/CMOL will be similar to Fig. 4.7, but blocks B1a and B2a are replaced with equivalent SVMM – Fig. 4.8b/c.) The required ADC resolution is 12 bits, corresponding to the actual size of M (specified in Table 4.2). We virtualize all the addition operations in Fig. 4.8b by using a single physical adder.

4.6.2 Mixed-Signal CMOL Architecture

The mixed-signal CMOL nanogrid SVMM is shown in Fig. 4.8c. This is similar to the mixed-signal CMOS SVMM shown in Fig. 4.8b, except that the CMOS analog cell array is replaced with a CMOL nanogrid, and each cell is replaced with an equivalent nanodevice, which functions as a binary switch. If the nanodevice is ON, and the input \( {X}_{i}^{b}\) on that particular horizontal nanowire is 1, then an “on” current will flow in the corresponding vertical nanowire (shown by a vertical arrow Open image in new window). Each vertical nanowire accumulates some units of “on” current as shown in Fig. 4.8c. The ADC then converts the current (or equivalent voltage) to digital. The specific functionality is the same as discussed earlier for the mixed-signal CMOS SVMM and is not repeated here.

The storage for X requires special consideration; elements of X are written by sequential memory access, but we need to read one bit from each Xi in parallel. We do not go into details of how that can be done, but instead use a worst case consideration that storage for X has extra routing/circuit overhead of around 50% for this increased functionality.

4.7 Performance/Price Analysis and Results

4.7.1 Performance/Price Analysis

The circuit components used by the various architectures analyzed here are summarized in Table 4.4. For each architecture, a basic dataflow block diagram (Figs. 4.6 and 4.7) is generated from these components. The performance13 (speed) and price (area and power) are derived using the computational requirements of the BM (Table 4.3), and the performance/price measures corresponding to the circuit components that accomplish these computations. The detailed equations for deriving the performance/price of all architectures are given in the Appendix (Sections 4.9.1 to 4.9.4).
Table 4.4

Major circuit components used in implementing a Bayesian memory (BM) module

Components

Digital CMOS architecture

Digital CMOL architecture

Mixed-signal CMOS architecture

Mixed-signal CMOL architecture

FXP

FLP

LNS

FXP

FLP

LNS

Digital Adder

Y YR

  

Y YR

  

Y YR

Y YR

Digital Multiplier

Y YR

  

Y YR

  

Y YR

YR

Digital Divider

YR

  

YR

  

YR

YR

Floating-Point Unita

 

Y

  

Y

   

Log Num. sys. Unita

  

Y

  

Y

  

ADC

      

Y

Y

Memory – SRAM

Y

Y

Y

   

Y

 

Memory – CMOL

   

Y

Y

Y

 

Y

Analog CID/DRAM Cell

      

Y

 

CMOL nanogrid

       

Y

aFPU and LNU internally comprise of traditional digital arithmetic/logic and memory components

Y = component utilized for major operations

YR = component utilized for range normalization operations

The performance/price values for traditional circuit components are adapted from Table 4.3 in [3]. The performance/price measures for the remaining components were obtained from the literature: FPU [48], LNU [49], ADC [50], divider [51], and analog CID/DRAM cell [37]. All digital components are scaled to a hypothetical 22 nm technology (as done in [3]), using first-order constant field scaling rules [45]. To make the comparisons as realistic as possible, we chose the 22 nm as a likely process technology when nanogrids become commercially available. Design of analog circuits below 90 nm is challenging [52], hence we conservatively scale analog circuits to 90 nm. The performance/price for CMOL components is based on the nanogrid analysis presented in [3]; the CMOL is scaled to Fnano = 3 nm (with underlying CMOS at 22 nm). Performance/price measures of various components are summarized in Table 4.5.
Table 4.5

Performance/price measures for various circuit components

Component

Area (= A) (mm2)

Power (W)

Time (s)

SRAM

\( 6\times {10}^{-7}{N}_{b}/2.85\)

\( a0.64A\)where, α= 0.1

\( 0.025\times {10}^{-9}\)

CMOL MEM

\( 1.803\times {10}^{-9}{N}_{b}\)

\( 8.974\times {10}^{-12}{N}_{b}\)

\( 1.72\times {10}^{-9}\)

Dig. Adder

\( \frac{1.2\times {10}^{-2}{a}_{t}^{2}N{\mathrm{log}}_{2}(N)}{32{\mathrm{log}}_{2}(32)}\)

\(0.04A\)where, at= 22/180

\( \frac{0.75\times {10}^{-9}{a}_{t}{\mathrm{log}}_{2}(N)}{{\mathrm{log}}_{2}(32)}\)

Dig. Multiplier

\( \frac{1.2\times {10}^{-4}N(M+1)}{16(16+1)}\)

\( \frac{1.7\times {10}^{-5}N(M+1)}{16(16+1)}\)

\( \frac{0.2\times {10}^{-9}(N+M)}{(16+16)}\)

Dig. Divider

\( \frac{309{a}_{t}^{2}N{\mathrm{log}}_{2}(N)}{55{\mathrm{log}}_{2}(55)}\)

\(0.04A\)where, at= 22/1200

\( 160\times {10}^{-9}{a}_{t}N/55\)

Analog CID/DRAM Cell

\( 3.24\times {10}^{-5}{a}_{t}^{2}\)

\( 5\times {10}^{-8}{a}_{t}^{2}\)where, at= 90/500

\( 1\times {10}^{-5}{a}_{t}\)

ADC – 12b

\( 17.22{a}_{t}\)

\( 0.033{a}_{t}\)where, at= 90/1200

\( 2\times {10}^{-7}{a}_{t}\)

FPU

\( 0.258{a}_{t}\)

\( 1.032\times {10}^{-2}{a}_{t}\)

where, at= 22/350

 

–Adder

  

\( 3{a}_{t}7.692\times {10}^{-9}\)

–Multiplier

  

\( 3{a}_{t}7.692\times {10}^{-9}\)

–Divider

  

\( 15{a}_{t}7.692\times {10}^{-9}\)

LNU

\( 16.6{a}_{t}\)

\( 0.355{a}_{t}\)where, at= 22/1200

\( {a}_{t}2.631\times {10}^{-8}\)

–Adder

  

\( {a}_{t}2.631\times {10}^{-8}\)

–Multiplier

  

\( {a}_{t}2.631\times {10}^{-8}\)

–Divider

  

\( {a}_{t}2.631\times {10}^{-8}\)

N= number of bits in operand-1;

M= number of bits in operand-2, and NM;

Nb= total number of bits in memory;

atdenotes technology scalingICfactor

4.7.2 Performance/Price Results and Discussion

The hardware analysis results for the digital architectures are summarized in Figs. 4.9 to 4.13, and for the mixed-signal architectures in Table 4.6.
Fig. 4.9

Area occupied by single BM: digital architectures

Table 4.6

Comparison of digital and mixed-signal architectures

Architecture

Single BM

Max-chipc

Area (mm2)

Power (W)

Time (s)

No. of BMs

Total Power (W)

Throughput Per Max-chip f (TPM)

Normalized TPM

MEMa

ARLb

Total

MEM

ARL

Total

Digital CMOSd

28.525

0.0048

28.529

1.82562

0.00028

1.82590

0.00790

30

54.78

 3797

76

Mixed-Signal CMOSe

282.06

1.5546

283.62

0.46599

0.00324

0.46923

0.05980

 3

 1.41

 50

 1

Digital CMOLd

0.2443

0.0048

0.2491

0.00122

0.00028

0.00149

0.03627

3443

 5.13

94848

1897

Mixed-Signal CMOLe

0.0121

1.5581

1.5703

0.02127

0.00324

0.02450

0.00727

 546

13.38

75103

1505

aMEM denotes memory components

bARL denotes Arithmetic/logic components

cMax-chip denotes a maximum reticle field chip size of 858 mm2

dPerformance/price corresponds to the digital CMOS/CMOL FXP8 architecture in Figs. 4.9 to 4.13

eMS architectures externally represent data digitally, with 8 bit FXP precision

fThroughput Per Max-chip (TPM) is measured as (modules updated per second) per 858 mm2 area. In this table, the TPM is derived by dividing the “No. of BMs” column by the “Time” column

In Fig. 4.9, we see that the “Memory components” (MEM) dominate the total area of the BM. This is a major advantage, especially when using CMOL memory, which is 100 times denser than CMOS memory (SRAM). For the “Arithmetic/logic components” (ARL) we see that the FLP and LNS architectures consume less area, as compared to the FXP8–FXP32 architectures, since the FXP architectures are not virtualized to the same extent. In addition, the FXP architectures also had overhead for several range normalization circuits, which were also not virtualized. But because memory dominates the area of a BM processor, this overhead does not add much to the total cost. The ARL in the FLP architecture occupies less area than the ARL in LNS architecture.

In Fig. 4.10, we again see that MEM dominates the total power consumption of the BM. CMOL MEM consumes 1,000 times less power than CMOS MEM, but this is somewhat misleading, since the power density of the hotspots in CMOL nanogrids could be as high as 200 W cm−2 (the maximum allowed by ITRS [43]). The SRAM has much lower power density for obvious reasons. The ARL in the FXP architectures consumes more power when compared to the ARL in the FLP and the LNS architectures, for reasons stated earlier, but again this is not a big issue. The ARL in the FLP architecture consumes less power when compared to the ARL in LNS architecture.
Fig. 4.10

Power consumed by single BM: digital architectures

Figure 4.11 shows the time required to update (execute the BPA) of a single BM. As compared to the FLP and LNS architectures, the FXP architectures have better performance. We assumed the range normalization overhead to obtain the maximum accuracy in a system with FXP precision (as compared to a system with FLP or LNS precision); our assumption did not affect the performance of the system. The LNS architecture is faster than the FLP architecture, because the specific LNU used here is three times faster than the FPU; in general, depending on the ratio of multiplication/addition operations, LNU based architectures can perform 1.5–2.6 times faster than FPU based architectures [49, 53]. All CMOL architectures are slower than the corresponding CMOS architectures by around 30 ms; this was expected because CMOL nanogrids are slow [3]. For digital architectures in general, timing can be easily reduced by adding more ARL components for semi-parallel functioning, and using multi-ported memories.
Fig. 4.11

Time to update single BM: digital architectures

Figure 4.12 shows how many BMs we can fit in a maximum size chip of 858 mm2 (i.e. a max-chip), which corresponds to the maximum reticle lithographic field size at 22 nm [3]. Because CMOL architectures are denser than CMOS architectures, we can fit around 100 times more BMs in the same area.
Fig. 4.12

BMs implemented on one max-chip: digital architectures

Figure 4.13 corresponds to Fig. 4.12, and shows the power consumed by all BMs residing in a max-chip; it assumes that all of these are concurrently active. The CMOL chip consumes about one-tenth the power of the CMOS chip; though the CMOL nanogrids have a higher power density.
Fig. 4.13

Total power consumed by BMs on one max-chip: digital architectures

From Table 4.6, when comparing digital CMOL to Mixed-Signal (MS) CMOL, we were able to achieve the speed-up (i.e. nbitxtSVMM < p⋅(tadd+ tmult)) using our virtualized SVMM technique. But the same is not true for the digital CMOS and MS CMOS architectures. The MS CMOS architecture provides the worst performance, with the highest price; only 3 BMs can fit one max-chip, which suggests it is not as scalable of an architecture. The digital CMOS and MS CMOL architectures have approximately the same performance, but MS CMOL is less expensive, allowing us to fit 18 times more BMs in one max-chip, with one-fourth the total power consumption. Notice that the power density of a max-chip for all architectures is well below the allowed 200 W cm−2 (for 22 nm CMOS/3 nm CMOL, according to ITRS [43]).

In summary, the MS CMOL architecture, which consists of CMOL memory and a MS nanogrid implementation of the VMM operation is clearly the most cost-effective of the architecture options examined here, providing the best performance at a reasonable price.

An important criterion for comparing various hardware architectures is the “performance/price ratio”, which measures the usefulness/efficiency of silicon area in solving the problem (or set of problems) at hand. For the hardware implementations of computational models, this criterion is defined as a “Module update rate per 858 mm2” or “Throughput Per Max-chip” (TPM) [3, 54]. Table 4.6 shows that the TPM for digital CMOL architecture is 25 times the TPM of digital CMOS architecture; the TPM for MS CMOL architecture is 20 times the TPM for digital CMOS architecture. The TPM for a PC MATLAB implementation of BM is 33, and the TPM for BM implementation on Cray XD1 (supercomputer accelerated with FPGAs) is 2,326 (derived14 from the throughputs reported in [21]). Hence, we see that the TPM of CMOL based custom architectures is 32–40 times better than Cray XD1 multi-processor/multi-FPGA system.

The nanogrids in all CMOL architectures studied here were designed to have a worst case power density of ∼200 W cm−2 (allowed by ITRS [43]) at hotspots. Hence, all CMOL performance/price numbers correspond to that density. If the power density budget were increased, then performance could be improved. For example, the time to update a MS CMOL architecture based BM reduces to 0.00421 s, if the power density budget were doubled. So there is a clear trade-off between power density and performance for CMOL nanogrids; the same was also suggested in [3].

4.7.3 Scaling Estimates for BM Based Cortex-Scale System

This chapter has exclusively focused on the hardware implementations of Part-B of the BM. The hardware assessment methodology (given in Fig. 4.2) has also been used to investigate the hardware implementations of Part-A of the BM, which deals with learning/training. The detailed discussion on hardware architectures for Part-A of the BM (with a simplified algorithm for learning spatial CB vectors) is given in [20], and not repeated here. The results therein conclude that Part-B of the BM dominates the overall hardware requirements and performance/price of the complete BM (including Part-A and Part-B). The results in [20] also indicate that it is not feasible to map analog/non-linear functions such as Euclidean and Gaussian functions onto CMOL-like nanogrid/nanodevices [55]. Consequently, the MS architectures (for Part-A of the BM) have to depend on traditional analog CMOS circuit components to implement such functions [20, 55], and this limits the usefulness of CMOL structures for implementing the operations within Part-A of the BM [2]. This is contrary to the hardware analysis results for Part-B of the BM, which shows that MS CMOL structures were particularly more cost-effective.

Table 4.7 presents four prospective hardware designs/architectures for the BM, and the scaling estimates for BM based (human) Cortex-Scale System (BMCSS). A single BM specified by Table 4.2 emulates approximately 32.8 × 106 synapses, according to the artificial neural network equivalent for BPA proposed in [56]. Consequently, 4.87 × 106 BMs operating in parallel are required to scale15 up to 1.6 × 1014 synapses in BMCSS. Table 4.7 proposes four different designs by selecting the type of designs for Part-A and Part-B of the BM. We used two designs (Digital CMOL and MS CMOL) with the best performance/price ratio (≅ TPM) for Part-B of the BM, and two designs (Digital CMOS and Digital CMOL) with the best performance/price ratio (≅ TPM) for Part-A of the BM.
Table 4.7

BM based cortex-scale system: final performance/price and scaling estimates

Proposed design

Part-A of BM

Part-B of BM

Total timeb (s)

Total areaa (mm2)

Norm. areac

No. of (30 cm) wafers

Perf./Price

Norm. perf./price

Total power

Net power density

(W mm−2)

(s−1mm−2)

(W)

1

Dig. CMOS

Dig. CMOL

3.63 × 10−2

1.67 × 106

6.96

27

1.65 × 10−5

1.00

3.73 × 104

2.23 × 10−2

2

Dig. CMOS

MS CMOL

7.28 × 10−3

8.11 × 106

33.79

128

1.69 × 10−5

1.03

1.49 × 105

1.84 × 10−2

3

Dig. CMOL

Dig. CMOL

3.64 × 10−2

1.22 × 106

5.08

20

2.25 × 10−5

1.36

8.37 × 103

6.85 × 10−3

4

Dig. CMOL

MS CMOL

7.45 × 10−3

7.66 × 106

31.92

121

1.75 × 10−5

1.06

1.21 × 105

1.57 × 10−2

aApproximately 4.876 × 106 BMs are required to reach cortex-scale

bAssumes that all BMs are operating in parallel

cTotal Area normalized by the actual area of cortex (2.4 × 105 mm2)

The normalized performance/price ratio for all architectures for the BMCSS is approximately equal, except the all-digital CMOL design (proposed design no. 3), which has the highest performance/price ratio. However, in terms of only silicon area, or the number of 30 cm wafers required to implement the BMCSS, the digital architectures are 5× to 7× more compact compared to the architectures with MS parts in the BM. For the purpose of a crude scaling estimate (in terms of density), the total area for the BMCSS can be compared to the actual area (2.4 × 105 mm2 [39]) of cortex, as shown in the column labeled “Norm. Area”; the artificial BMCSS is approximately 5× to 33× larger than the actual cortex. Hence, we can optimistically conclude that with the possible advances in nanoelectronics, we are approaching biological densities, with manageable power (i.e. power density is within the ITRS allowed limits 64–200 W cm−2 [57]); this is similar to the claims in [43, 54, 57].

4.8 Conclusion, Contribution and Future Work

The results in this chapter suggest that implementation of Bayesian inference engines and/or Bayesian BICMs is going to be significantly memory dominated. We conclude then that any enabling technology for large-scale Bayesian inference engines will need very high density storage (with potential for inherent computations), and that this storage needs to be accessed via high bandwidth, which limits the use of off-chip storage. Hybrid nanotechnologies such as CMOL have emerged as a very useful candidate for building such large-scale Bayesian inference engines.

We have shown how to effective use the CMOL hybrid nanotechnology as memory, and as a mixed-signal computational structure, for implementing Bayesian inference engines, which forms the core of many AI and machine learning techniques.

We have also proposed a novel use of a mixed-signal CMOL structure for Vector-Matrix Multiplication (VMM), VMM being one of the most fundamental operations used in many computational algorithms. This mixed-signal CMOL structure provides speeds comparable to digital CMOS, which is due to the efficient integration of storage and computation at the nanodevice/grid level.

For our particular application, we have shown that silicon real-estate is utilized much more efficiently (at least 32–40 times better TPM ≅ speed per unit-area) in the case of CMOL, as compared to, for example, a multi-processor/multi-FPGA system, that uses CMOS technology. When compared to a single general-purpose processor/PC, the TPM of CMOL based architectures is 2,200–2,800 times better. It is obvious that CMOL is denser (and slow when used as memory), but what is not as obvious is that CMOL when used in mixed-signal mode can provide reasonable speed-up. However, while using mixed-signal CMOL, external post-processing CMOS components dominate the area, hence, their virtualization is a crucial constraint for design and scaling.

In addition to the results of the architecture study, another important contribution of this chapter is the development and use of a “hardware design space exploration” methodology for architecting hardware and analyzing its performance/price; this methodology has the potential of being used as an “investigation tool” for various computational models and their implementations. This is particularly true for when chip designers start transitioning their existing CMOS solutions to nanotechnology based solutions.

As future work, one needs to explore the remaining design space using both traditional and CMOL technology, and also search for new nanoelectronics circuit candidates that may be better suitable for implementing Bayesian inference engines. Another important approach is to investigate the possibility of more approximate inference techniques that trade-off accuracy for performance.

4.9 Appendix

Note: The values of the variables (related to Pearl’s algorithm) used in the following equations are given in Table 4.2. The performance/price measures for the various circuit components used in the following equations are given in Table 4.5. Most of the subscripts are self-explanatory, and denote the particular component from Table 4.5. The superscripts denote the number of bits for that particular circuit component from Table 4.5. All timing numbers for digital components first need to be normalized as a multiple of tclk = 9.32 × 10−11 s, before being used in these equations. The subscript ‘BPA’ refers to Pearl’s belief propagation Eqs. 4.1–4.4 and 4.6; subscript ‘ARL’ denotes arithmetic/logic components; subscript ‘MEM’ denotes memory.

4.9.1 Digital FLP or LNS Architecture

The following equations are for FLP architecture; for LNS architecture, replace subscript ‘FPU’ with ‘LNU’. Also, the following equations are for digital CMOS architectures; for digital CMOL architectures replace subscript ‘SRAM’ with ‘CMOL-MEM’. Here, nbit = 32, corresponding to single-precision FLP/LNS data.

Time

$$ {t}_{\rm{BPA1}}=\left({n}_{c} {n}_{CBy} ({t}_{\rm{FPU - mul}}+{t}_{\rm{SRAM}})\right) +({n}_{CBy} {t}_{\rm{SRAM}})$$
$$ {t}_{\rm{BPA2}}=\left({n}_{CBz} \ {n}_{CBy} \ ({t}_{\rm{FPU - add}}+{t}_{\rm{FPU - mul}}+{t}_{\rm{SRAM}})\right)+({n}_{CBz} \ {t}_{\rm{SRAM}})$$
$$ {t}_{\rm{BPA3}}=\left({n}_{CBy} {n}_{CBz} ({t}_{\rm{FPU - add}}+{t}_{\rm{FPU - mul}}+{t}_{\rm{SRAM}})\right) +({n}_{CBy} {t}_{\rm{SRAM}})$$
$$ \begin{array}{l}{t}_{\text{BPA4}}=\left(2{n}_{CBy} \text({t}_{\text{FPU}\text{-}\text{mul}}+{t}_{\text{SRAM}})+({n}_{CBy} \text{t}_{\text{SRAM}})\right)\text+\dots \\ \left({n}_{CBy} \text({t}_{\text{FPU}\text{-}\text{mul}}+{t}_{\text{FPU - add}})+(3{n}_{CBy} \text{t}_{\text{SRAM}})\right)\text+{t}_{\text{FPU}\text{-}\text{div}}\end{array}$$
$$ \begin{array}{l}{t}_{\text{BPA6}}=\left({n}_{c} \text{n}_{c} \text{n}_{CBy} \text({t}_{\text{FPU - mul}}+{t}_{\text{SRAM}})+({n}_{c} \text{n}_{CBy} \text{t}_{\text{SRAM}})\right)\text+\dots \\ \left({n}_{c} \text{n}_{CBy} \text({t}_{\text{FPU - mul}}+{t}_{\text{FPU - add}})+(3{n}_{c} \text{n}_{CBy} \text{t}_{\text{SRAM}})\right)\text+({n}_{c} \text{t}_{\text{FPU - div}})\end{array}$$
$$ {t}_{\rm{final}}=\mathrm{max}\left(({t}_{\rm{BPA1}}+{t}_{\rm{BPA2}}+{t}_{\rm{BPA4}}),({t}_{\rm{BPA3}}+{t}_{\rm{BPA6}})\right)$$

Area

$$ {n}_{membits}={n}_{bit} \\(4{n}_{CBy}+{n}_{c} \\{n}_{CBy}+{n}_{CBz}+{n}_{CBz} \\{n}_{CBy})$$
$$ {A}_{\rm{MEM}}={A}_{SRAM}^{({n}_{membits})}$$
$$ {A}_{\rm{ARL}}=2{A}_{\rm{FPU}}$$

Power

$$ {P}_{\rm{MEM}}={P}_{SRAM}^{({n}_{membits})}$$
$$ {P}_{\rm{ARL}}=2{P}_{\rm{FPU}}$$

4.9.2 Digital FXP Architecture

Here, nbit is varied in the range [4, 8, …, 32], to obtain the performance/price of FXP4, FXP8, …, FXP32, architectures respectively.
$$ {n}_{b1}=({n}_{c}-1)\\ \\{n}_{bit}$$
$$ {n}_{b3}=ceil({\mathrm{log}}_{2}({n}_{CBy}))+2{n}_{bit}$$
$$ {n}_{b4}=ceil({\mathrm{log}}_{2}({n}_{CBz}))+2{n}_{bit}$$

Time

$$ \begin{array}{l}{t}_{\text{BPA1}}=\left({n}_{c} \text{n}_{CBy} \text({t}_{\text{FXP}\text{-}\text{mul}}^{({n}_{b1},{n}_{bit})}+{t}_{\text{SRAM}})\right)\text+({n}_{CBy} \text{t}_{\text{SRAM}})+\dots \\ \left({n}_{CBy} \text({t}_{\text{FXP}\text{-}\text{add}}^{({n}_{b1}+{n}_{bit})}+{t}_{\text{FXP}\text{-}\text{mul}}^{({n}_{b1}+{n}_{bit},{n}_{bit})}+{t}_{\text{FXP}\text{-}\text{div}}^{({n}_{b1}+2{n}_{bit})}+{t}_{\text{SRAM}})\right)\end{array}$$
$$ \begin{array}{l}{t}_{\text{BPA2}}=\left({n}_{CBz} \text{n}_{CBy} \text({t}_{\text{FXP}\text{-}\text{add}}^{({n}_{b3})}+{t}_{\text{FXP}\text{-}\text{mul}}^{({n}_{bit},{n}_{bit})}+{t}_{\text{SRAM}})\right)\text+({n}_{CBz} \text{t}_{\text{SRAM}})\text+\dots \\ \left({n}_{CBz} \text({t}_{\text{FXP}\text{-}\text{add}}^{({n}_{b3})}+{t}_{\text{FXP}\text{-}\text{mul}}^{({n}_{b3},{n}_{bit})}+{t}_{\text{FXP}\text{-}\text{div}}^{({n}_{b3}+{n}_{bit})}+{t}_{\text{SRAM}})\right)\end{array}$$
$$ \begin{array}{l}{t}_{\text{BPA3}}=\left({n}_{CBy} \text{n}_{CBz} \text({t}_{\text{FXP}\text{-}\text{add}}^{({n}_{b4})}+{t}_{\text{FXP}\text{-}\text{mul}}^{({n}_{bit},{n}_{bit})}+{t}_{\text{SRAM}})\right)\text+({n}_{CBy} \text{t}_{\text{SRAM}})+\dots \\ \left({n}_{CBy} \text({t}_{\text{FXP}\text{-}\text{add}}^{({n}_{b4})}+{t}_{\text{FXP}\text{-}\text{mul}}^{({n}_{b4},{n}_{bit})}+{t}_{\text{FXP}\text{-}\text{div}}^{({n}_{b4}+{n}_{bit})}+{t}_{\text{SRAM}})\right)\end{array}$$
$$ \begin{array}{l}{t}_{\text{BPA4}}=\left(2{n}_{CBy} \text({t}_{\text{FXP}\text{-}\text{mul}}^{({n}_{bit},{n}_{bit})}+{t}_{\text{SRAM}})+({n}_{CBy} \text{t}_{\text{SRAM}})\right)\text+\dots \\ \left({n}_{CBy} \text({t}_{\text{FXP}\text{-}\text{add}}^{(2{n}_{bit})}+{t}_{\text{FXP}\text{-}\text{mul}}^{(2{n}_{bit},{n}_{bit})}+{t}_{\text{FXP}\text{-}\text{div}}^{(3{n}_{bit})}+{t}_{\text{SRAM}})\right)\end{array}$$
$$ \begin{array}{l}{t}_{\text{BPA6}}={n}_{c} \left({n}_{CBy} \text({t}_{\text{FXP}\text{-}\text{mul}}^{({n}_{b1},{n}_{bit})}+{t}_{\text{SRAM}})+({n}_{CBy} \text{t}_{\text{SRAM}})\right)\text+\dots \\ \text{n}_{c} \left({n}_{CBy} \text({t}_{\text{FXP}\text{-}\text{add}}^{({n}_{b1}+{n}_{bit})}+{t}_{\text{FXP}\text{-}\text{mul}}^{({n}_{b1}+{n}_{bit},{n}_{bit})}+{t}_{\text{FXP}\text{-}\text{div}}^{({n}_{b1}+2{n}_{bit})}+{t}_{\text{SRAM}})\right)\end{array}$$
$$ {t}_{\rm{final}}=\mathrm{max}\left(({t}_{\rm{BPA1}}+{t}_{\rm{BPA2}}+{t}_{\rm{BPA4}}),({t}_{\rm{BPA3}}+{t}_{\rm{BPA6}})\right)$$

Area

$$ {A}_{\rm{BPA1 - ARL}}={A}_{\rm{FXP - mul}}^{({n}_{b1},{n}_{bit})}+\left({A}_{\rm{FXP - add}}^{({n}_{b1}+{n}_{bit})}+{A}_{\rm{FXP - mul}}^{({n}_{b1}+{n}_{bit},{n}_{bit})}+{A}_{\rm{FXP - div}}^{({n}_{b1}+2{n}_{bit})}\right)$$
$$ {A}_{\rm{BPA2 - ARL}}=\left({A}_{\rm{FXP - add}}^{({n}_{b3})}+{A}_{\rm{FXP - mul}}^{({n}_{bit},{n}_{bit})}\right)\\+\left({A}_{\rm{FXP - add}}^{({n}_{b3})}+{A}_{\rm{FXP - mul}}^{({n}_{b3},{n}_{bit})}+{A}_{\rm{FXP - div}}^{({n}_{b3}+{n}_{bit})}\right)$$
$$ {A}_{\rm{BPA3 - ARL}}=\left({A}_{\rm{FXP - add}}^{({n}_{b4})}+{A}_{\rm{FXP - mul}}^{({n}_{bit},{n}_{bit})}\right)\\+\left({A}_{\rm{FXP - add}}^{({n}_{b4})}+{A}_{\rm{FXP - mul}}^{({n}_{b4},{n}_{bit})}+{A}_{\rm{FXP - div}}^{({n}_{b4}+{n}_{bit})}\right)$$
$$ {A}_{\rm{BPA4 - ARL}}={A}_{\rm{FXP - mul}}^{({n}_{bit},{n}_{bit})}+\left({A}_{\rm{FXP - add}}^{(2{n}_{bit})}+{A}_{\rm{FXP - mul}}^{(2{n}_{bit},{n}_{bit})}+{A}_{\rm{FXP - div}}^{(3{n}_{bit})}\right)$$
$$ {A}_{\rm{BPA6 - ARL}}={A}_{\rm{FXP - mul}}^{({n}_{b1},{n}_{bit})}+\left({A}_{\rm{FXP - add}}^{({n}_{b1}+{n}_{bit})}+{A}_{\rm{FXP - mul}}^{({n}_{b1}+{n}_{bit},{n}_{bit})}+{A}_{\rm{FXP - div}}^{({n}_{b1}+2{n}_{bit})}\right)$$
$$ {A}_{\rm{ARL}}={A}_{\rm{BPA1 - ARL}}+{A}_{\rm{BPA2 - ARL}}+{A}_{\rm{BPA3 - ARL}}+{A}_{\rm{BPA4 - ARL}}+{A}_{\rm{BPA6 - ARL}}$$
$$ \begin{array}{l}{n}_{membits}={n}_{bit} \text(4{n}_{CBy}+{n}_{c} \text{n}_{CBy}+{n}_{CBz}+{n}_{CBz} \text{n}_{CBy})\text+\dots \\ \left(({n}_{c}+1)\text{n}_{c}\text{n}_{bit}\text{n}_{CBy}+{n}_{b3}\text{n}_{CBz}+{n}_{b4}\text{n}_{CBy}+2{n}_{bit}\text{n}_{CBy}\right)\end{array}$$
$$ {A}_{\rm{MEM}}={A}_{\rm{SRAM}}^{({n}_{membits})}$$

Power

$$ {P}_{\rm{BPA1 - ARL}}={P}_{\rm{FXP - mul}}^{({n}_{b1},{n}_{bit})}+\left({P}_{\rm{FXP - add}}^{({n}_{b1}+{n}_{bit})}+{P}_{\rm{FXP - mul}}^{({n}_{b1}+{n}_{bit},{n}_{bit})}+{P}_{\rm{FXP - div}}^{({n}_{b1}+2{n}_{bit})}\right)$$
$$ {P}_{\rm{BPA2 - ARL}}=\left({P}_{\rm{FXP - add}}^{({n}_{b3})}+{P}_{\rm{FXP - mul}}^{({n}_{bit},{n}_{bit})}\right)\\+\left({P}_{\rm{FXP - add}}^{({n}_{b3})}+{P}_{\rm{FXP - mul}}^{({n}_{b3},{n}_{bit})}+{P}_{\rm{FXP - div}}^{({n}_{b3}+{n}_{bit})}\right)$$
$$ {P}_{\rm{BPA3 - ARL}}=\left({P}_{\rm{FXP - add}}^{({n}_{b4})}+{P}_{\rm{FXP - mul}}^{({n}_{bit},{n}_{bit})}\right)\\+\left({P}_{\rm{FXP - add}}^{({n}_{b4})}+{P}_{\rm{FXP - mul}}^{({n}_{b4},{n}_{bit})}+{P}_{\rm{FXP - div}}^{({n}_{b4}+{n}_{bit})}\right)$$
$$ {P}_{\rm{BPA4 - ARL}}={P}_{\rm{FXP - mul}}^{({n}_{bit},{n}_{bit})}+\left({P}_{\rm{FXP - add}}^{(2{n}_{bit})}+{P}_{\rm{FXP - mul}}^{(2{n}_{bit},{n}_{bit})}+{P}_{\rm{FXP - div}}^{(3{n}_{bit})}\right)$$
$$ {P}_{\rm{BPA6 - ARL}}={P}_{\rm{FXP - mul}}^{({n}_{b1},{n}_{bit})}+\left({P}_{\rm{FXP - add}}^{({n}_{b1}+{n}_{bit})}+{P}_{\rm{FXP - mul}}^{({n}_{b1}+{n}_{bit},{n}_{bit})}+{P}_{\rm{FXP - div}}^{({n}_{b1}+2{n}_{bit})}\right)$$
$$ {P}_{\rm{ARL}}={P}_{\rm{BPA1 - ARL}}+{P}_{\rm{BPA2 - ARL}}+{P}_{\rm{BPA3 - ARL}}+{P}_{\rm{BPA4 - ARL}}+{P}_{\rm{BPA6 - ARL}}$$
$$ {P}_{\rm{MEM}}={P}_{\rm{SRAM}}^{({n}_{membits})}$$

4.9.3 Mixed-Signal CMOS Architecture

For mixed-signal CMOS architectures, each element of the CPT, and external data have (nbit =) 8 bit representation. The mixed-signal CMOS SVMM implements only 4.2 and 4.3 of the BPA; for 4.1, 4.4, and 4.6, use the corresponding performance/price equations from FXP architecture.

Time

$$ \begin{array}{l}{t}_{\text{BPA2}}=\left({n}_{CBz} \text{n}_{bit} \text({t}_{\text{CID}}+{n}_{bit} \text{t}_{\text{FXP - add}}^{({n}_{b3})}+{t}_{\text{ADC}})\right)\text+\dots \\ \left({n}_{CBz} \text({t}_{\text{FXP}\text{-}\text{add}}^{({n}_{b3})}+{t}_{\text{FXP}\text{-}\text{mul}}^{({n}_{b3},{n}_{bit})}+{t}_{\text{FXP}\text{-}\text{div}}^{({n}_{b3}+{n}_{bit})}+{t}_{\text{SRAM}})\right)\end{array}$$
$$ \begin{array}{l}{t}_{\text{BPA3}}=\left({n}_{CBy} \text{n}_{bit} \text({t}_{\text{CID}}+{n}_{bit} \text{t}_{\text{FXP}\text{-}\text{add}}^{({n}_{b4})}+{t}_{\text{ADC}})\right)\text+\dots \\ \left({n}_{CBy} \text({t}_{\text{FXP}\text{-}\text{add}}^{({n}_{b4})}+{t}_{\text{FXP}\text{-}\text{mul}}^{({n}_{b4},{n}_{bit})}+{t}_{\text{FXP}\text{-}\text{div}}^{({n}_{b4}+{n}_{bit})}+{t}_{\text{SRAM}})\right)\end{array}$$
$$ {t}_{\rm{final}}=\mathrm{max}\left(({t}_{\rm{BPA1}}+{t}_{\rm{BPA2}}+{t}_{\rm{BPA4}}),({t}_{\rm{BPA3}}+{t}_{\rm{BPA6}})\right)$$

Area

$$ {A}_{\rm{BPA2 - ARL}}=\left({A}_{\rm{FXP - add}}^{({n}_{b3})}+{n}_{bit} \\{A}_{\rm{ADC}}\right)\\+\left({A}_{\rm{FXP - add}}^{({n}_{b3})}+{A}_{\rm{FXP - mul}}^{({n}_{b3},{n}_{bit})}+{A}_{\rm{FXP - div}}^{({n}_{b3}+{n}_{bit})}\right)$$
$$ {A}_{\rm{BPA2 - MEM}}={n}_{CBz} \\{n}_{CBy} \\{n}_{bit} \\{A}_{\rm{CID}}$$
$$ {A}_{\rm{BPA3 - ARL}}=\left({A}_{\rm{FXP - add}}^{({n}_{b4})}+{n}_{bit}\ \\{A}_{\rm{ADC}}\right)\\+\left({A}_{\rm{FXP - add}}^{({n}_{b4})}+{A}_{\rm{FXP - mul}}^{({n}_{b4},{n}_{bit})}+{A}_{\rm{FXP - div}}^{({n}_{b4}+{n}_{bit})}\right)$$
$$ {A}_{\rm{BPA3 - MEM}}={n}_{CBy} \\{n}_{CBz} \\{n}_{bit} \\{A}_{\rm{CID}}$$
$$ \begin{array}{l}{n}_{membits}={n}_{bit} \text(4.5{n}_{CBy}+{n}_{c} \text{n}_{CBy}+1.5{n}_{CBz})\text+\dots \\ \left(({n}_{c}+1)\text{n}_{c}\text{n}_{bit}\text{n}_{CBy}+{n}_{b3}\text{n}_{CBz}+{n}_{b4}\text{n}_{CBy}+2{n}_{bit}\text{n}_{CBy}\right)\end{array}$$
$$ {A}_{\rm{ARL}}={A}_{\rm{BPA1 - ARL}}+{A}_{\rm{BPA2 - ARL}}+{A}_{\rm{BPA3 - ARL}}+{A}_{\rm{BPA4 - ARL}}+{A}_{\rm{BPA6 - ARL}}$$
$$ {A}_{\rm{MEM}}={A}_{\rm{SRAM}}^{({n}_{membits})}+{A}_{\rm{BPA2 - MEM}}+{A}_{\rm{BPA3 - MEM}}$$

Power

$$ {P}_{\rm{BPA2 - ARL}}=\left({P}_{\rm{FXP - add}}^{({n}_{b3})}+{n}_{bit} \\{P}_{\rm{ADC}}\right)\\+\left({P}_{\rm{FXP - add}}^{({n}_{b3})}+{P}_{\rm{FXP - mul}}^{({n}_{b3},{n}_{bit})}+{P}_{\rm{FXP - div}}^{({n}_{b3}+{n}_{bit})}\right)$$
$$ {P}_{\rm{BPA2 - MEM}}={n}_{CBz} \\{n}_{CBy} \\{n}_{bit} \\{P}_{\rm{CID}}$$
$$ {P}_{\rm{BPA3 - ARL}}=\left({P}_{\rm{FXP - add}}^{({n}_{b4})}+{n}_{bit} \\{P}_{\rm{ADC}}\right)\\+\left({P}_{\rm{FXP - add}}^{({n}_{b4})}+{P}_{\rm{FXP - mul}}^{({n}_{b4},{n}_{bit})}+{P}_{\rm{FXP - div}}^{({n}_{b4}+{n}_{bit})}\right)$$
$$ {P}_{\rm{BPA3 - MEM}}={n}_{CBy} \\{n}_{CBz} \\{n}_{bit} \\{P}_{\rm{CID}}$$
$$ {P}_{\rm{ARL}}={P}_{\rm{BPA1 - ARL}}+{P}_{\rm{BPA2 - ARL}}+{P}_{\rm{BPA3 - ARL}}+{P}_{\rm{BPA4 - ARL}}+{P}_{\rm{BPA6 - ARL}}$$
$$ {P}_{\rm{MEM}}={P}_{\rm{SRAM}}^{({n}_{membits})}+{P}_{\rm{BPA2 - MEM}}+{P}_{\rm{BPA3 - MEM}}$$

4.9.4 Mixed-Signal CMOL Architecture

For mixed-signal CMOL architecture, each element of the CPT, and external data have (nbit =) 8-bit representation. The mixed-signal CMOL-based SVMM implements only 4.2 and 4.3 of the BPA; for 4.1, 4.4, and 4.6, use the corresponding performance/price equations from FXP architecture.

MS CMOL Nanogrid for the SVMM

The following equations for CMOL nanogrid modeling are adapted from [3] and [58] (more details can be found there).
$$ D=18/{(8\rm{nm}/{F}_{\rm{nano}})}^{2}$$
$$ {R}_{off}=4000{R}_{on}$$
$$ {R}_{con}=r/{F}_{\rm{nano}}^{2}$$
$$ L=2(N+M){F}_{\rm{nano}}$$
$$ {R}_{wire}=L{r}_{0}(1+(\ell /{F}_{\rm{nano}}))/{F}_{\rm{nano}}^{2}$$
$$ t=(2{R}_{con}+1.5{R}_{wire}+({R}_{on}/D))•{C}_{wire}$$
$$ {P}_{on}=abNM{V}^{2}/(2{R}_{con}+{R}_{wire}+({R}_{on}/D))$$
$$ {P}_{leak}=a(1-b)NM{V}^{2}/(2{R}_{con}+{R}_{wire}+({R}_{off}/D))$$
$$ {P}_{dyn}=ag(N+M){C}_{wire}{V}^{2}/t$$
$$ {P}_{grid}={P}_{on}+{P}_{leak}+{P}_{dyn}$$
$$ {A}_{grid}=4NM{F}_{nano}^{2}$$

Where, \( {\rho }_{\rm{\hspace{0.05em}}0}=\rm{2}m\Omega \rm{cm}\), \( l=10\text{nm}\), \( {C}_{wire}/L=0.2\rm{fF}m{\rm{m}}^{-}\)[9]; \( r= \ 1{0}^{-8} \Omega c{m}^{2}\)[3]; \( {R}_{on}=100\rm{M}\Omega \), \( V=0.3\rm{V}\), \( a=0.5(1/M)\), \( b=g=0.5\) [58]; and for our implementation N = nCBy for 4.2, N = nCBz for 4.3, and M = nbit.

Time

$$ \begin{array}{l}{t}_{\text{BPA2}}=\left({n}_{CBz} \text{n}_{bit} \text(t+{n}_{bit} \text{t}_{\text{FXP}\text{-}\text{add}}^{({n}_{b3})}+{t}_{\text{ADC}})\right)\text+\dots \\ \left({n}_{CBz} \text({t}_{\text{FXP}\text{-}\text{add}}^{({n}_{b3})}+{t}_{\text{FXP}\text{-}\text{mul}}^{({n}_{b3},{n}_{bit})}+{t}_{\text{FXP}\text{-}\text{div}}^{({n}_{b3}+{n}_{bit})}+{t}_{\text{CMOL}\text{-}\text{MEM}})\right)\end{array}$$
$$ \begin{array}{l}{t}_{\text{BPA3}}=\left({n}_{CBy} \text{n}_{bit} \text(t+{n}_{bit} \text{t}_{\text{FXP}\text{-}\text{add}}^{({n}_{b4})}+{t}_{\text{ADC}})\right)\text+\dots \\ \left({n}_{CBy} \text({t}_{\text{FXP}\text{-}\text{add}}^{({n}_{b4})}+{t}_{\text{FXP}\text{-}\text{mul}}^{({n}_{b4},{n}_{bit})}+{t}_{\text{FXP}\text{-}\text{div}}^{({n}_{b4}+{n}_{bit})}+{t}_{\text{CMOL}\text{-}\text{MEM}})\right)\end{array}$$
$$ {t}_{\rm{final}}=\mathrm{max}\left(({t}_{\rm{BPA1}}+{t}_{\rm{BPA2}}+{t}_{\rm{BPA4}}),({t}_{\rm{BPA3}}+{t}_{\rm{BPA6}})\right)$$

Area

$$ {A}_{\rm{BPA2 - ARL}}=\left({A}_{\rm{FXP - add}}^{({n}_{b3})}+{n}_{bit} \\{A}_{\rm{ADC}}\right)\\+\left({A}_{\rm{FXP - add}}^{({n}_{b3})}+{A}_{\rm{FXP - mul}}^{({n}_{b3},{n}_{bit})}+{A}_{\rm{FXP - div}}^{({n}_{b3}+{n}_{bit})}\right)$$
$$ {A}_{\rm{BPA2 - MEM}}={n}_{CBz} \\{A}_{grid}$$
$$ {A}_{\rm{BPA3 - ARL}}=\left({A}_{\rm{FXP - add}}^{({n}_{b4})}+{n}_{bit}\ \\{A}_{\rm{ADC}}\right)\\+\left({A}_{\rm{FXP - add}}^{({n}_{b4})}+{A}_{\rm{FXP - mul}}^{({n}_{b4},{n}_{bit})}+{A}_{\rm{FXP - div}}^{({n}_{b4}+{n}_{bit})}\right)$$
$$ {A}_{\rm{BPA3 - MEM}}={n}_{CBy} \\{A}_{grid}$$
$$ \begin{array}{l}{n}_{membits}={n}_{bit} \text(4.5{n}_{CBy}+{n}_{c} \text{n}_{CBy}+1.5{n}_{CBz})\text+\dots \\ \left(({n}_{c}+1)\text{n}_{c}\text{n}_{bit}\text{n}_{CBy}+{n}_{b3}\text{n}_{CBz}+{n}_{b4}\text{n}_{CBy}+2{n}_{bit}\text{n}_{CBy}\right)\end{array}$$
$$ {A}_{\rm{ARL}}={A}_{\rm{BPA1 - ARL}}+{A}_{\rm{BPA2 - ARL}}+{A}_{\rm{BPA3 - ARL}}+{A}_{\rm{BPA4 - ARL}}+{A}_{\rm{BPA6 - ARL}}$$
$$ {A}_{\rm{MEM}}={A}_{\rm{CMOL - MEM}}^{({n}_{membits})}+{A}_{\rm{BPA2 - MEM}}+{A}_{\rm{BPA3 - MEM}}$$

Power

$$ {P}_{\rm{BPA2 - ARL}}=\left({P}_{\rm{FXP - add}}^{({n}_{b3})}+{n}_{bit} \\{P}_{\rm{ADC}}\right)\\+\left({P}_{\rm{FXP - add}}^{({n}_{b3})}+{P}_{\rm{FXP - mul}}^{({n}_{b3},{n}_{bit})}+{P}_{\rm{FXP - div}}^{({n}_{b3}+{n}_{bit})}\right)$$
$$ {P}_{\rm{BPA2 - MEM}}={n}_{CBz} \\{P}_{grid}$$
$$ {P}_{\rm{BPA3 - ARL}}=\left({P}_{\rm{FXP - add}}^{({n}_{b4})}+{n}_{bit} \\{P}_{\rm{ADC}}\right)\\+\left({P}_{\rm{FXP - add}}^{({n}_{b4})}+{P}_{\rm{FXP - mul}}^{({n}_{b4},{n}_{bit})}+{P}_{\rm{FXP - div}}^{({n}_{b4}+{n}_{bit})}\right)$$
$$ {P}_{\rm{BPA3 - MEM}}={n}_{CBy} \\{P}_{grid}$$
$$ {P}_{\rm{ARL}}={P}_{\rm{BPA1 - ARL}}+{P}_{\rm{BPA2 - ARL}}+{P}_{\rm{BPA3 - ARL}}+{P}_{\rm{BPA4 - ARL}}+{P}_{\rm{BPA6 - ARL}}$$
$$ {P}_{\rm{MEM}}={P}_{\rm{CMOL - MEM}}^{({n}_{membits})}+{P}_{\rm{BPA2 - MEM}}+{P}_{\rm{BPA3 - MEM}}$$

4.9.5 Example: Use of Architecture Assessment Methodology for Associative Memory Model

We briefly discuss the use of our architecture assessment methodology for implementing an associative memory model (for further details on the model/implementation, refer to [3], [54]). Each step in Fig. 4.2 is briefly summarized below:

Step-1: Assume the (Palm and Willshaw) associative memory model [3].

Steps-2 and 3: Assume that both the input (X) and output (Y) vectors are binary, and a fixed number of active elements (or ‘1’s), l and k respectively. Assume that the weight matrix is trained using the summation-rule (instead of the simple OR-rule), and hence consists of multi-bit weight values. During recall, input vector X is applied to the network; each intermediate output element is obtained as (inner-product) \( {\tilde{y}}_{i}={\displaystyle {\sum }_{j}{w}_{ij}{x}_{j}}\); then a global threshold \( (q)\) is selected such that exactly k number of output elements are ‘1’, i.e. \( {\forall }_{i}\), set yi = 1 if \( {\tilde{y}}_{i}-q < 0\), else set yi = 0.

Step-4: Major equations/operations during recall are: the inner-product for \( {\tilde{y}}_{i}\), and the k-winner take all (k-WTA) to get yi.

Step-5: The inner-product consists of multiplication and addition type ­computations, and the k-WTA consists of comparison type computations. For the inner-product, multiplication is simple, because input X is binary, hence a multi-input AND-gate can be substituted for a multiplier.

Steps-6 and 7: To accomplish the above computations (and storage) various circuit components (CMOS/Nano, digital/analog) can be used, hence leading to different design configurations. Some of these proposed configurations are: digital CMOS, digital CMOL, mixed-signal (MS) CMOS, and MS CMOL. (There are other possible configurations, but they not covered here. In addition, the MS designs proposed here are different from [3].)

Step-8: For a worst case analysis, assume that the weight matrix is full density (not sparse), and that the performance is estimated using sequential operations and/or worst case timing paths.

Step-9: In digital CMOS design, the weight matrix is stored as SRAM/eDRAM, the inner-product uses digital components, and k-WTA uses digital k-WTA circuit component [3]. Digital CMOL design is the same as digital CMOS, except that SRAM is replaced with CMOL-nanogrid based digital memory. In MS CMOS design, the weight matrix can be stored in an analog floating-gate (FG-cell) array [59]; the k-WTA is done using analog k-WTA circuit [3]. In MS CMOL design, the weight matrix can be stored on 1R nanodevices (or nano-memristors [60]) in a nanogrid; the k-WTA is done using analog k-WTA circuit [3]. In both MS CMOS/CMOL designs, for the inner-product, the multiply occurs in each FG-cell/nanodevice, and the addition occurs along the vertical nanowires as (analog) summation of charge/currents [3]. The size (and performance/price) of digital and analog components depend on the size of the input and output vectors, and the number of active ‘1’s, etc., as selected by the user. The size of the CMOL nanogrids also depends indirectly on the size of the input and output vectors.

Step-10: A system-level block diagram needs to be generated for each design, using the constituent circuit components discussed earlier.

Step-11: The performance/price measures of all digital/analog CMOS circuit components can be derived from (example) published sources. For new components such as the CMOL nanogrid structures, the performance/price measures can be modeled using fundamental area/power equations along with Elmore-delay analysis [3]. These measures have to be scaled to the appropriate/desired CMOS/CMOL technology node using VLSI scaling rules [45].

Step-12: The final performance/price for the complete design (for each particular design configuration) needs to be evaluated considering the system-level block diagram, using the individual performance/price measures of the circuit components, and based on the computational requirements of the operations (i.e. according to the size of the input/output vectors, etc.).

Footnotes

  1. 1.

     “Performance overkill” is where the highest-volume segments of the market are no longer performance/clock frequency driven.

  2. 2.

     “Density overkill” is where it is difficult for a design team to effectively design and verify all the transistors available to them on a typical design schedule.

  3. 3.

     We use the term CMOL to describe a specific family of nanogrid structures as developed by Likharev et al.

  4. 4.

     From a traditional computer engineering perspective, the detailed implementation of designs from regions 5–8 could be referred to as micro-architectures, but for simplicity, we refer to them as architectures.

  5. 5.

     This methodology has been used to study digital and mixed-signal designs/architectures in the past; in order to use it for other designs (including analog designs) one may need to make some adjustments.

  6. 6.

     The “diagnostic” message is also called bottom-up evidence, and the “causal” message is also called top-down evidence.

  7. 7.

     Probabilistic belief or Bayesian confidence is simply referred to as “belief”.

  8. 8.

    A viable alternative (not explored here) to SRAM is eDRAM, which is denser, but is significantly slower.

  9. 9.

     Here we are trading-off local computation and storage for longer range communication.

  10. 10.

     Considering that nCBynCBz, we cannot virtualize the blocks across 4.2 and 4.3. We could virtualize the range normalization blocks for 4.4 and 4.6, but have not done so here for simplicity’s sake.

  11. 11.

     Where, top denotes the time required to complete ‘op’ type operation.

  12. 12.

     This may not be always true, because it depends on the size of M.

  13. 13.

     In accordance with our baseline assumption, the timing analysis for the digital architectures assumes that most of the operations within each block are executed in a sequential manner without any pipelining; however, some of the blocks are executing in parallel.

  14. 14.

     This system has 864 AMD processors, and 150 FPGAs; the reported Nodes/s is 2.25 × 106 for large network of BMs. The TPM is derived by using (22/90) technology-scaling factor for 22 nm, and (1,014 × 200/858 mm2) area factor; we assume each processor-chipset or FPGA (with SRAM banks) will have an area of 200 mm2.

  15. 15.

     Such a crude scaling estimate is intended only for guidance purposes. No structural / functional equivalence to the actual cortex and/or human intelligence is intended or claimed.

Notes

Acknowledgment

Useful discussions with many colleagues, including Prof. K.K. Likharev, Dr. Changjian Gao, and Prof. G.G. Lendaris are gratefully acknowledged.

References

  1. 1.
    M.S. Zaveri, D. Hammerstrom, CMOL/CMOS implementations of Bayesian polytree inference: digital & mixed-signal architectures and performance/price. IEEE Trans. Nanotechnology 9(2), 194–211 (2010). DOI: 10.1109/TNANO.2009.2028342
  2. 2.
    D. Hammerstrom, M.S. Zaveri, Prospects for building cortex-scale CMOL/CMOS circuits: a design space exploration, in Proceedings of IEEE Norchip Conference (Trondheim, Norway, 2009)Google Scholar
  3. 3.
    C. Gao, D. Hammerstrom, Cortical models onto CMOL and CMOS – architectures and performance/price. IEEE Trans Circ. Syst-I 54, 2502–2515 (2007)MathSciNetCrossRefGoogle Scholar
  4. 4.
    S. Borkar, Electronics beyond nano-scale CMOS, in Proceedings of 43rd Annual ACM/IEEE Design Automation Conf. (San Francisco, CA, 2006), pp. 807–808Google Scholar
  5. 5.
    R.I. Bahar, D. Hammerstrom, J. Harlow, W.H.J. Jr., C. Lau, D. Marculescu, A. Orailoglu, M. Pedram, Architectures for silicon nanoelectronics and beyond, IEEE Computer 40, 25–33 (2007)CrossRefGoogle Scholar
  6. 6.
    D. Hammerstrom, A survey of bio-inspired and other alternative architectures, in Nanotechnology: Information Technology-II, ed. by R. Waser, vol. 4 (Wiley-VCH Verlag GmbH: Weinheim, Germany, 2008), pp. 251–285Google Scholar
  7. 7.
    Intel, 60 years of the transistor: 1947–2007, Intel Corp., Hillsboro, OR (2007), http://www.intel.com/technology/timeline.pdf
  8. 8.
    V. Beiu, Grand challenges of nanoelectronics and possible architectural solutions: what do Shannon, von Neumann, Kolmogorov, and Feynman have to do with Moore, in Proceedings of 37th IEEE International Symposium on Multiple-Valued Logic, Oslo, Norway, 2007Google Scholar
  9. 9.
    D.B. Strukov, K.K. Likharev, CMOL FPGA: a reconfigurable architecture for hybrid digital circuits with two-terminal nanodevices. Nanotechnology 16, 888–900 (2005)CrossRefGoogle Scholar
  10. 10.
    Ö. Türel, J.H. Lee, X. Ma, K. K. Likharev, Architectures for nanoelectronic implementation of artificial neural networks: new results, Neurocomputing 64, 271–283 (2005)Google Scholar
  11. 11.
    K.K. Likharev, D.V. Strukov, CMOL: devices, circuits, and architectures, in Introduction to molecular electronics, ed. by G. Cuniberti, G. Fagas, K. Richter (Springer, Berlin, 2005), pp. 447–478Google Scholar
  12. 12.
    D.B. Strukov, K.K. Likharev, Reconfigurable hybrid CMOS/nanodevice circuits for image processing. IEEE Trans. Nanotechnol. 6, 696–710 (2007)CrossRefGoogle Scholar
  13. 13.
    G. Snider, R. Williams, Nano/CMOS architectures using a field-programmable nanowire interconnect. Nanotechnology 18, 1–11 (2007)Google Scholar
  14. 14.
    NAE, Reverse-engineer the brain, Grand challenges for engineering (The U.S. National Academy of Engineering (NAE) of The National Academies, Washington, DC, [online], 2008), http://www.engineeringchallenges.org. Accessed 15 February 2008
  15. 15.
    R. Ananthanarayanan, D.S. Modha, Anatomy of a cortical simulator, in ACM/IEEE Conference on High Performance Networking and Computing: Supercomputing, Reno, NV, 2007Google Scholar
  16. 16.
    D. George, J. Hawkins, A hierarchical Bayesian model of invariant pattern recognition in the visual cortex, in Proceedings of International Joint Conference on Neural Networks (Montreal, Canada, 2005), pp. 1812–1817Google Scholar
  17. 17.
    T.S. Lee, D. Mumford, Hierarchical Bayesian inference in the visual cortex. J. Opt. Soc. Am. A. Opt. Image Sci. Vis. 20, 1434–1448 (July 2003)CrossRefGoogle Scholar
  18. 18.
    T. Dean, Learning invariant features using inertial priors, Annals of Mathematics and Artificial Intelligence 47, 223–250 (2006)Google Scholar
  19. 19.
    G.G. Lendaris, On Systemness and the problem solver: tutorial comments. IEEE Trans. Syst. Man Cy. 16, 604–610 (1986)Google Scholar
  20. 20.
    M.S. Zaveri, CMOL/CMOS hardware architectures and performance/price for Bayesian memory – The building block of intelligent systems, Ph.D. dissertation, Department of Electrical and Computer Engineering, Portland State University, Portland, OR, October 2009Google Scholar
  21. 21.
    K.L. Rice, T.M. Taha, C.N. Vutsinas, Scaling analysis of a neocortex inspired cognitive model on the Cray XD1, J. Supercomput. 47, 21–43 (2009)Google Scholar
  22. 22.
    D. George, A mathematical canonical cortical circuit model that can help build future-proof parallel architecture, Workshop on Technology Maturity for Adaptive Massively Parallel Computing (Intel Inc., Portland, OR, March 2009), http://www.technologydashboard.com/adaptivecomputing/Presentations/MPAC%20Portland__Dileep.pdf
  23. 23.
    C. Gao, M.S. Zaveri, D. Hammerstrom, CMOS / CMOL architectures for spiking cortical column, in Proceedings of IEEE World Congress on Computational Intelligence – International Joint Conference on Neural Networks, Hong Kong, 2008, pp. 2442–2449Google Scholar
  24. 24.
    E. Rechtin, The art of systems architecting, IEEE Spectrum 29, 66–69, 1992Google Scholar
  25. 25.
    D. Hammerstrom, Digital VLSI for neural networks, in The Handbook of Brain Theory and Neural Networks, ed. by M.A. Arbib (MIT Press, Cambridge, MA, 1998), pp. 304–309Google Scholar
  26. 26.
    J. Bailey, D. Hammerstrom, Why VLSI implementations of associative VLCNs require connectionmultiplexing, in Proceedings of IEEE International Conference on Neural Networks (San Diego, CA, 1988), pp. 173–180Google Scholar
  27. 27.
    J. Schemmel, J. Fieres, K. Meier, Wafer-scale integration of analog neural networks, in Proc. IEEE World Congress on Computational Intelligence – International Joint Conference on Neural Networks (Hong Kong, 2008), pp. 431–438Google Scholar
  28. 28.
    K.A. Boahen, Point-to-point connectivity between neuromorphic chips using address events. IEEE Trans. Circ. Syst. II: Anal. Dig. Sig. Process. 47, 416–434 (2000)MATHCrossRefGoogle Scholar
  29. 29.
    D. George, B. Jaros, The HTM learning algorithms (Numenta Inc., Menlo Park, CA, Whitepaper, March 2007), http://www.numenta.com/for-developers/education/Numenta_HTM_Learning_Algos.pdf
  30. 30.
    K.L. Rice, T.M. Taha, and C.N. Vutsinas, Hardware acceleration of image recognition through a visual cortex model, Optics Laser Tech. 40, 795–802 (2008)Google Scholar
  31. 31.
    C.N. Vutsinas, T.M. Taha, K.L. Rice, A neocortex model implementation on reconfigurable logic with streaming memory, in IEEE International Symposium on Parallel and Distributed Processing (Miami, FL, 2008), pp. 1–8Google Scholar
  32. 32.
    R.C. O’Reilly, Y. Munakata, J.L. McClelland, Computational Explorations in Cognitive Neuroscience: Understanding the Mind by Simulating the Brain, 1st edn. (MIT Press, Cambridge, MA, 2000)Google Scholar
  33. 33.
    J. Hawkins, D. George, Hierarchical temporal memory: Concepts, theory and terminology (Numenta Inc., Menlo Park, CA, Whitepaper, March 2007), http://www.numenta.com/Numenta_HTM_Concepts.pdf
  34. 34.
    J. Hawkins, S. Blakeslee, On Intelligence (New York: Times Books, Henry Holt, 2004)Google Scholar
  35. 35.
    D. Hammerstrom, M.S. Zaveri, Bayesian memory, a possible hardware building block for intelligent systems, AAAI Fall Symp. Series on Biologically Inspired Cognitive Architectures (Arlington, VA) (AAAI Press, Menlo Park, CA, TR FS-08–04, Nov. 2008), p. 81Google Scholar
  36. 36.
    J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference (Morgan Kaufmann, San Francisco, CA, 1988)Google Scholar
  37. 37.
    R. Genov, G. Cauwenberghs, Charge-mode parallel architecture for vector–matrix multiplication, IEEE Trans. Circ. Syst.-II 48, 930–936 (2001)Google Scholar
  38. 38.
    B. Murmann, Digitally assisted analog circuits, IEEE Micro 26, 38–47 (2006)Google Scholar
  39. 39.
    C. Johansson, A. Lansner, Towards cortex sized artificial neural systems, Neural Networks 20, 48–61 (2007)Google Scholar
  40. 40.
    R. Granger, Brain circuit implementation: high-precision computation from low-precision components, in Replacement Parts for the Brain, ed. by T. Berger, D. Glanzman (MIT Press, Cambridge, MA, 2005), pp. 277–294Google Scholar
  41. 41.
    S. Minghua, A. Bermak, An efficient digital VLSI implementation of Gaussian mixture models-based classifier, IEEE Trans.VLSI Syst. 14, 962–974 (2006)Google Scholar
  42. 42.
    D.B. Strukov, K.K. Likharev, Defect-tolerant architectures for nanoelectronic crossbar memories, J. Nanosci. Nanotechnol. 7, 151–167 (2007)Google Scholar
  43. 43.
    K.K. Likharev, D.B. Strukov, Prospects for the development of digital CMOL circuits, in Proceedings of International Symposium on Nanoscale Architectures (San Jose, CA, 2007), pp. 109–116Google Scholar
  44. 44.
    J.M. Rabaey, Digital Integrated Circuits: A Design Perspective (Prentice Hall, Upper Saddle River, NJ, 1996)Google Scholar
  45. 45.
    N. Weste, D. Harris, CMOS VLSI Design - A Circuits and Systems Perspective, 3rd edn. (Addison Wesley/Pearson, Boston, MA, 2004)Google Scholar
  46. 46.
    M. Haselman, M. Beauchamp, A. Wood, S. Hauck, K. Underwood, K.S. Hemmert, A comparison of floating point and logarithmic number systems for FPGAs, in 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (Napa, CA, 2005), pp. 181–190Google Scholar
  47. 47.
    K.K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation (Wiley, New York, 1999)Google Scholar
  48. 48.
    K. Seungchul, L. Yongjoo, J. Wookyeong, L. Yongsurk, Low cost floating point arithmetic unit design, in Proceedings of IEEE Asia-Pacific Conference on ASIC (Taipei, Taiwan, 2002), pp. 217–220Google Scholar
  49. 49.
    D.M. Lewis, 114 MFLOPS logarithmic number system arithmetic unit for DSP applications, IEEE J. Solid-St. Circ. 30, 1547–1553 (1995)Google Scholar
  50. 50.
    P.C. Yu, H.-S. Lee, A 2.5-V, 12-b, 5-MSample/s pipelined CMOS ADC, IEEE J. Solid-St. Circ. 31, 1854–1861 (1996)Google Scholar
  51. 51.
    T.E. Williams, M.A. Horowitz, A Zero-overhead self-timed 160-ns 54-b CMOS divider. IEEE J. Solid-St. Circ. 26, 1651–1662 (1991)CrossRefGoogle Scholar
  52. 52.
    G. Gielen, R. Rutenbar, S. Borkar, R. Brodersen, J.-H. Chern, E. Naviasky, D. Saias, C. Sodini, Tomorrow’s analog: just dead or just different? in 43rd ACM/IEEE Design Automation Conference (San Francisco, CA, 2006), pp. 709–710Google Scholar
  53. 53.
    J.N. Coleman, E.I. Chester, A 32-Bit logarithmic arithmetic unit and its performance compared to floating-point, in Proceedings of 14th IEEE Symposium on Computer Arithmetic (Adelaide, Australia, 1994), pp. 142–151Google Scholar
  54. 54.
    C. Gao, Hardware architectures and implementations for associative memories – The building blocks of hierarchically distributed memories, Ph.D. dissertation, Department of Electrical and Computer Engineering, Portland State University, Portland, OR, Nov 2008Google Scholar
  55. 55.
    P. Narayanan, T. Wang, M. Leuchtenburg, C.A. Moritz, Comparison of analog and digital nanosystems: Issues for the nano-architect, in Proc. 2nd IEEE International Nanoelectronics Conference (Shanghai, China, 2008), pp. 1003–1008Google Scholar
  56. 56.
    D. George, J. Hawkins, Belief propagation and wiring length optimization as organizing principles for cortical microcircuits (Numenta Inc., Menlo Park, CA, 2007), http://www.stanford.edu/~dil/invariance/Download/CorticalCircuits.pdf
  57. 57.
    K.K. Likharev, Hybrid CMOS/nanoelectronic circuits: opportunities and challenges. J. Nanoelectron. Optoelectron. 3, 203–230 (2008)Google Scholar
  58. 58.
    C. Gao, D. Hammerstrom, CMOL based cortical models, in Emerging brain-inspired nano-architectures, ed. by V. Beiu and U. Rückert (Singapore: World Scientific, 2008 in press)Google Scholar
  59. 59.
    M. Holler, S. Tam, H. Castro, R. Benson, An electrically trainable artificial neural network (ETANN) with 10240 “floating gate” synapses, in International Joint Conference on Neural Networks (San Diego, CA, 1989), pp. 191–196Google Scholar
  60. 60.
    S.H. Jo, K.-H. Kim, W. Lu, Programmable resistance switching in nanoscale two-terminal devices. Nano Lett. 9, 496–500 (2009)Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  1. 1.Department of Electrical and Computer Engineering, Associate Dean – Maseeh College of Engineering and Computer SciencePortland State UniversityPortlandUSA
  2. 2.Dhirubhai Ambani Institute of Information and Communication TechnologyGandhinagarIndia

Personalised recommendations