The term hierarchical temporal memory (HTM) describes a specific realization of the Thousand Brains Theory. HTM builds models of physical objects and conceptual ideas to make predictions, and it generates motor commands to interact with the surroundings and test the predictions. The continuous testing allows HTM to update the predictive models and, thus, its knowledge, leading to intelligent behavior in an ever-changing world [25,26,27, 30].
This section first explains how HTM’s general structure models the neocortex. It then discusses each HTM part in more detail. To provide a compact, understandable description of HTM, we simplify the neuroscience, keeping only essential information.
General structure
We first consider the building blocks of the neocortex, the neurons. Many neurons in the neocortex are excitatory, while others are inhibitory. When an excitatory neuron fires, it causes other neurons to fire. If an inhibitory neuron fires, it prevents other neurons from firing. The HTM model includes a mechanism to inhibit neurons, but it focuses on the excitatory neurons’ functionality since about 80% of the neurons in the neocortex are excitatory [23, Ch. 3]. Because the pyramidal neurons [31] constitute the majority of the excitatory neurons, HTM contains abstract pyramidal neurons called HTM neurons.
The HTM model consists of regions of HTM neurons. The regions are divided into vertical cortical columns [29], as shown in Fig. 2. All cortical columns have the same laminar structure with six horizontal layers on top of each other (Fig. 3). Five of the layers contain mini-columns [32] of HTM neurons. A neuron in a mini-column connects to many other neurons in complicated ways (not shown in Fig. 3). A mini-column in the neocortex can span multiple layers. All cortical columns run essentially the same learning algorithm, the previously mentioned common cortical algorithm, based on their common circuitry.
The HTM regions connect in approximate hierarchies. Figure 4 illustrates two imperfect hierarchies of vertically connected regions with horizontal connections between the hierarchies. All regions in a hierarchy integrate sensory and motor information. Information flows up and down in a hierarchy and both ways between hierarchies.
Building on the general structure of HTM, the rest of the section explains how the HTM parts fulfill the previously listed data structure properties (sparse data representations, realistic neuron model, and reference frames) and the architectural properties (continuous online learning, sensorimotor integration, and single general-purpose algorithm). It also describes how the HTM parts depend on each other.
Sparse distributed representations (SDRs)
Empirical evidence shows that every region of the neocortex represents information using sparse activity patterns made up of a small percentage of active neurons, with the remaining neurons being inactive. An SDR is a set of binary vectors where a small percentage of 1s represent active neurons, and the 0s represent inactive neurons. The small percentage of 1s, denoted the sparsity, varies from less than one percent to several percent. SDRs are the primary data structure used in the neocortex and used everywhere in HTM systems. There is not a single type of SDRs in HTM but distinct types for various purposes.
While a bit position in a dense representation like ASCII has no semantic meaning, the bit positions in an SDR represent a particular property. The semantic meaning depends on what the input data represents. Some bits may represent edges or big patches of color; others might correspond to different musical notes. Figure 5 shows a somewhat contrived but illustrative example of an SDR representing parts of a zebra. If we flip a single bit in a vector from a dense representation, the vector may take an entirely different value. In an SDR, nearby bit positions represent similar properties. If we invert a bit, then the description changes but not radically.
The mathematical foundation for SDRs and their relationship to the HTM model is described in [33,34,35]. SDRs are crucial to HTM. Unlike dense representations, SDRs are robust to large amounts of noise. SDRs allow HTM neurons to store and recognize a dynamic set of patterns from noisy data. Taking unions of sparse vectors in SDRs make it possible to perform multiple simultaneous predictions reliably. The properties described in [33, 34] also determine what parameter values to use in HTM software. Under the right set of parameters, SDRs enable a massive capacity to learn temporal sequences and form highly robust classification systems.
Every HTM system needs the equivalent of human sensory organs. We call them “encoders.” A set of encoders allows implementations to encode data types such as dates, times, and numbers, including coordinates, into SDRs [36]. These encoders enable HTM-based systems to operate on other data than humans receive through their senses, opening up the possibility of intelligence in areas not covered by humans. An example is small intelligent machines operating at the molecular level; another is intelligent machines operating in toxic environments to humans.
HTM neurons
Biological neurons are pattern recognition systems that receive a constant stream of sparse inputs and send outputs to other neurons represented by electrical spikes known as action potentials [37]. Pyramidal neurons, the most common neurons in the neocortex, are pretty different from the typical neurons modeled in deep learning systems. Deep learning uses so-called point neurons, which compute a weighted sum of scalar inputs and send out scalar outputs, as shown in Fig. 6a. Pyramidal neurons are significantly more complicated. They contain separate and independent zones that receive diverse information and have various spatiotemporal properties [38].
Figure 6b illustrates how the HTM neuron models the structure of pyramidal neurons (the synapses and details about the signal processing are left out). The HTM neuron receives sparse input from artificial dendrites, segregated into areas called the apical (feedback signals from upper layers), basal (signals within the layer), and proximal (feedforward input) integration zones. Each dendrite in the apical and basal integration zones is an independent processing entity capable of recognizing different patterns [37].
In pyramidal neurons, when one or more dendrites in the apical or basal zone detect a pattern, they generate a voltage spike that travels to the pyramidal cell body or soma. These dendritic spikes do not directly create an action potential but instead cause a temporary increase in the cell body’s voltage, making it primed to respond quickly to subsequent feedforward input. The HTM neuron models these changes using one of three states: active, predictive, or inactive. In the active state, the neuron outputs a 1 on the artificial axon analogous to an action potential; in the other states, it outputs a 0. Patterns detected on the proximal dendrite drive the HTM neuron into the active state, representing a natural pyramidal neuron firing. Pattern matching on the basal or apical dendrites moves the neuron into the predictive state, representing a primed cell body that is not yet firing.
Continuous online learning
To understand how HTM learns time-based sequences or patterns, we consider how the network operates before describing the connection-based learning. Consider a layer in a cortical column. Figure 7 depicts a layer of mini-columns with interconnected HTM neurons (connections not shown). This network receives a part of a noisy sequence at each time instance, given by a sparse vector from an SDR encoder. All HTM neurons in a mini-column receive the same subset of bits on their proximal dendrites, but different mini-columns receive different subsets of the vector. The mini-columns activate neurons, generating 1s, to represent the sequence part. (Here, we assume that the network has learned a consistent representation that removes noise and other ambiguities.)
Each HTM neuron predicts its activation, i.e., moves into the predictive state in various contexts by matching different patterns on its basal or apical dendrites. Figure 8 depicts dark gray neurons in the predictive state. According to this prediction, input on the feedforward dendrites will move the neuron into the active state in the following time instance. To illustrate how predictions occur, let a network’s context be the states of its neurons at time \(t-1\). Some neurons move to the predictive state at time t based on the previous context (Fig. 8a). If the context contains feedback projections from higher levels, the network forms predictions based on high-level expectations (Fig. 8b). In both cases, the network makes temporal predictions.
Learning occurs by reinforcing those connections that are consistent with the predictions and penalizing connections that are inconsistent. HTM creates new connections when predictions are mistaken [37]. When HTM first creates a network of mini-columns, it randomly generates potential connections between the neurons. HTM assigns a scalar value called “permanence” to each connection. The permanence takes on values from zero to one. The value represents the longevity of a connection as illustrated in Fig. 9.
If the permanence value is close to zero, there is a potential for a connection, but it is not operative. If the permanence value exceeds a threshold, such as 0.3, then the connection becomes operative, but it could quickly disappear. A value close to one represents an operative connection that will last for a while. A (Hebbian-like) rule, using only information local to a neuron, increases and decreases the permanence value. Note that while neural networks with point neurons have fixed connections with real weights, the operative connections in HTM networks have weight one, and the inoperative connections have weight zero.
The whole HTM network operates and learns as follows. It receives a new consecutive part of a noisy sequence at each time instance. Each mini-column with HTM neurons models a competitive process in which neurons in the predictive state emit spikes sooner than inactive neurons. HTM then deploys fast local inhibition to prevent the inactive neurons from firing, biasing the network toward the predictions. The permanence values are then updated before the process repeats at the next time instance.
In short, a cyclic sequence of activations, leading to predictions, followed by activations again, forms the basis of HTM’s sequence memory. HTM continually learns sequences with structure by verifying its predictions. When the structure changes, the memory forgets the old structure and learns the new one. Since the system learns by confirming predictions, it does not require any explicit teacher labels. HTM keeps track of multiple candidate sequences with common subsequences until further input identifies a single sequence. The use of SDRs makes networks robust to noisy input, natural variations, and neuron failures [35, 37].
Reference frames
We have outlined how networks of mini-columns learn predictive models of changing sequences. Here, we start to address how HTM learns predictive models of static objects, where the input changes due to sensor movements. We consider how HTM uses allocentric reference frames, i.e., frames around objects [24, 39]. Sensory features are related to locations in this object-centrix reference frame. Changes due to movement are then mapped to changes of locations in the frame, enabling predictions of what sensations to expect when sensors move over an object.
To understand the distinction between typical deep learning representations and a reference frame-based representation, consider JPEG images of a coffee mug vs. a 3D CAD model of the mug. On the one hand, an AI system that uses just images would need to store hundreds of pictures taken at every possible orientation and distance to make detailed predictions using image-based representations. It would need to memorize the impact of movements and other changes for each object separately. Deep learning systems today use such brute-force image-based representations.
On the other hand, an AI system that uses a 3D CAD representation would make detailed predictions once it has inferred the orientation and distance. It only needs to learn the impact of movements once, which applies to all reference frames. Such a system could then efficiently predict what would happen when a machine’s sensor, such as an artificial fingertip, moves from one point on the mug to another, independent of how the cup is oriented relative to the machine.
Every cortical column in HTM maintains allocentric reference frame models of the objects it senses. Sparse vectors represent locations in the reference frames. A network of HTM neurons uses these sparse location signals as context vectors to make detailed sensory predictions (Fig. 8c). Movements lead to changes in the location signal, which in turn leads to new forecasts.
HTM generates the location signal by modeling grid cells, initially found in the entorhinal cortex, the primary interface between the hippocampus and neocortex [40]. Animals use grid cells for navigation and represent their body location in an allocentric reference frame, namely that of the external environment. As an animal moves around, the cells use internal motion signals to update the location signal. HTM proposes that every cortical column in the neocortex contains cells analogous to entorhinal grid cells. Instead of representing the body’s location in the reference frame of the environment, these cortical grid cells represent a sensor’s location in the object’s reference frame. The activity of grid cells represents a sparse location signal used to predict sensory input (Fig. 8c).
Sensorimotor integration
Sensorimotor integration occurs in every cortical column of the neocortex. Each cortical column receives sensory input and sends out motor commands [41]. In HTM, sensorimotor integration allows every cortical column to build reference-frame models of objects [24, 39, 42]. Each cortical column in HTM contains two layers called the sensory input layer and the output layer, as shown in Fig. 10. The sensory input layer receives direct sensory input and contains mini-columns of HTM neurons, while the output layer contains HTM neurons that represent the sensed object. The sensory input layer learns specific object feature/location combinations, while the output layer learns representations corresponding to objects (see [42] for details).
During inference, the sensory input layer of each cortical column in Fig. 10 receives two sparse vector signals. First, a location signal computed by grid cells in the lower half of the cortical column (not shown) [24, 30, 39, 43]. Second, feedforward sensory input from a unique sensor array, such as the area of an artificial retina. The input layer combines sensory input and location input to form sparse representations that correspond to features at specific locations on the object. Thus, the cortical column knows both what features it senses and where the sensor is on the object.
The output layer receives feedforward inputs from the sensory input layer and converges to a stable pattern representing the object. The second layer achieves convergence in two ways: (1) by integrating information over time as the sensors move relative to the object and (2) spatially via lateral (sideways) connections between columns that simultaneously sense different locations on the same object. The lateral connections across the output layers permit HTM to quickly resolve ambiguity and deduce objects based on adjacent columns’ partial knowledge. Finally, feedback from the output layer to the sensory input layer allows the input layer to more precisely predict what feature will be present after the sensor’s subsequent movement.
The resulting object models are stable and invariant to a sensor’s position relative to the object, or equivalently, the object’s position relative to the sensor. All cortical columns in any region of HTM, even columns in the low-level regions (Fig. 4), can learn representations of complete objects through sensors’ movement. Simulations show that a single column can learn to recognize hundreds of 3D objects, with each object containing tens of features [42]. The invariant models enable HTM to learn with very few examples since the system does not need to sense every object in every possible configuration.
The need for cortical hierarchies
Because the spatial extent of the lateral connections between a region’s columnar output layers limits the ability to learn expansive objects, HTM uses hierarchies of regions to represent large objects or combine information from multiple senses. To illustrate, when a person sees and touches a boat, many cortical columns in the visual and somatosensory hierarchies, as illustrated in Fig. 4, observe different parts of the boat simultaneously. All cortical columns in each of the two hierarchies learn models of the boat by integrating over movements of sensors. Due to the non-hierarchical connections, represented by the horizontal connections in Fig. 4, inference occurs with the movement of the different types of sensors, leading to a rapid model building. Observe that the boat models in each cortical column are different because they receive different information depending on their location in the hierarchy, information processing in earlier regions, and signals on lateral connections.
Whenever HTM learns a new object, a type of neuron, called a displacement cell, enables HTM to represent the object as a composition of previously learned objects [24]. For example, a coffee cup is a composition of a cylinder and a handle arranged in a particular way. This object compositionality is fundamental because it allows HTM to learn new physical and abstract objects efficiently without continually learning from scratch. Many objects exhibit behaviors. HTM discovers an object’s behavior by learning the sequence of movements tracked by displacement cells. Note that the resulting behavioral models are not, first and foremost, internal representations of an external world but rather tools used by the neocortex to predict and experience the world.
Toward a common cortical algorithm
Cortical columns in the neocortex, regardless of their sensory modality or position in a hierarchy, contain almost identical biological circuitry and perform the same basic set of functions. The circuitry in all layers of a column defines the common cortical algorithm. Although the complete common cortical algorithm is unknown, the current version of HTM models fundamental parts of this algorithm.
Section 5.4 describes how a single layer in a cortical column learns temporal sequences. Different layers use this learning technique with varying amounts of neurons and different learning parameters. Section 5.5 introduces reference frames, and Sect. 5.6 outlines how a cortical column uses reference frames. Finally, Sect. 5.7 describes how columns cooperate to infer quickly and create composite models of objects by combining previously learned models. A reader wanting more details about the common cortical algorithm’s data formats, architecture, and learning techniques should study references [24, 30, 33,34,35,36,37, 39, 42, 44, 45].
Although HTM does not provide a complete description of the common cortical algorithm, we can summarize its novel ideas, focusing on how grid and displacement cells allow the algorithm to create predictive models of the world. The new aspects of the model building are [24, 30]:
-
Cortical grid cells provide every cortical column with a location signal needed to build object models.
-
Since every cortical column can learn complete models of objects, an HTM system creates thousands of models simultaneously.
-
A new class of neurons, called displacement cells, enables HTM to learn how objects are composed of other objects.
-
HTM learns the behavior of an object by learning the sequence of movements tracked by displacement cells.
-
Since all cortical columns run the same algorithm, HTM learns conceptual ideas the same way it learns physical objects by creating reference frames.
Each cortical column in HTM builds models independently as if it is a brain in itself. The name Thousand Brains Theory denotes the many independent models of an object at all levels of the region hierarchies, and the extensive integration between the columns [24, 30].