# Random indexing of multidimensional data

## Abstract

Random indexing (RI) is a lightweight dimension reduction method, which is used, for example, to approximate vector semantic relationships in online natural language processing systems. Here we generalise RI to multidimensional arrays and therefore enable approximation of higher-order statistical relationships in data. The generalised method is a sparse implementation of random projections, which is the theoretical basis also for ordinary RI and other randomisation approaches to dimensionality reduction and data representation. We present numerical experiments which demonstrate that a multidimensional generalisation of RI is feasible, including comparisons with ordinary RI and principal component analysis. The RI method is well suited for online processing of data streams because relationship weights can be updated incrementally in a fixed-size distributed representation, and inner products can be approximated on the fly at low computational cost. An open source implementation of generalised RI is provided.

### Keywords

Data mining Random embeddings Dimensionality reduction Sparse coding Semantic similarity Streaming algorithm Natural language processing## 1 Introduction

There is a rapid increase in the annual amount of data that is produced in almost all domains of science, industry, economy, medicine and even everyday life. We have surpassed a critical point where more data are generated than we can physically store. Choosing which data to archive and process and which to discard is necessary in data-intensive applications. That trend motivates the development of new methods for data representation and analysis [2, 23, 49].

One interesting approach to analyse large data sets is to search for outstanding relationships between “features” in the data. Problems of that type naturally appear in the form of context- or time-dependent relationships. For example, the co-occurrence of words in articles, blogs and so on is one type of relationship that carries information about language use and evolution over time [46]. Similarly, co-occurrence analysis can be used to investigate the general opinion about things, for example public events or politicians, and how the opinion changes over time [51]. The analysis requires averaging over many instances of relationships in order to identify frequent or otherwise significant patterns in a noise-like background. That is a non-trivial problem because the number of possible relationships between elements of the sets \(A_i\) scales like \(\mathcal{O}(\Pi _i |A_i|)\), where \(|A_i|\) denotes the cardinality of the set \(A_i\). In the example of online text analysis \(|A|\sim 10^5\), which implies \(\sim \!\!10^{10}\) co-occurrence weights that evolve over time and typically depend on additional context variables of interest. Therefore, the number of relationship weights that need to be stored and updated in such applications can be astronomical, and the analysis prohibitive given the large size of the data representation.

This is the motivation of random indexing (RI) [31], which is a random-projection method that solves such problems by incrementally generating distributional representations that approximate similarities in sets of co-occurrence weights. For example, in the context of natural language processing the RI method is used to compress large word–document or word–context co-occurrence matrices [52]. This is done by associating each document or context with a sparse random ternary vector of high dimensionality [29, 30], a so-called index vector. Each word is also represented by a high-dimensional vector of integers, a so-called distributional vector. These distributional vectors are initially set to zero, and for each appearance of a particular word in a context, the index vector of that context is added to the distributional vector of the word. The result of this incremental process is that words that appear in similar contexts get similar distributional vectors, indicating that they are semantically related [44]. Therefore, the analysis of semantic similarity can be performed by comparing the compressed distributional vectors in terms of inner products, instead of analysing and storing the full co-occurrence matrix. The distributional vectors can be updated on the fly in streaming applications by adding the appropriate sparse index vectors and co-occurrence weights to the distributional vectors. See Sahlgren [42, 43] for further details.

LSA [14] and HAL [36] are two other prominent examples of vector-space models [52] used for semantic analysis of text. In these methods, a co-occurrence matrix is explicitly constructed, and then singular value decomposition (SVD) is used to identify the semantic relationships between terms (see [8] for recent examples). This process requires significant storage space for the full co-occurrence matrix, and it is a computationally costly method. The SVD can be calculated using parallel and iterative methods optimised for sparse matrices [4], but the computational cost still prevents the processing of large and streaming data sets [11]. In contrast, RI easily scales to large corpora such as the MEDLINE collection of approximately 9 million abstracts [10]. Another approach known as locality-sensitive hashing (LSH) [7] is compared with RI on a distributional similarity task by Gorman and Curran [21], showing that RI outperforms LSH in terms of efficiency and accuracy when the problem size increases. RI requires a fraction of the memory and processing power of LSA and HAL [11], but is comparable with models based on SVD in terms of accuracy. For example, the accuracy of RI is comparable to SVD-based methods in a TOEFL synonym identification task [31], and that result has been further improved in the case of RI [45]. RI of co-occurrence matrices for semantic analysis works surprisingly well [11, 30, 43, 52], and the method has been adopted in other applications, such as indexing of literature databases [54], event detection in blogs [26], web user clustering and page prefetching [57], graph searching for the semantic web [12], diagnosis code assignment to patients [24], predicting speculation in biomedical literature [55] and failure prediction [19]. In general, there is an increasing interest for randomisation in information processing because it enables the use of simple algorithms, which can be organised to exploit parallel computation in an efficient way [6, 22].

The practical usefulness of RI is also demonstrated by several implementations in public software packages such as the S-Space Package [27] and the Semantic Vectors Package [58], and extensions of the basic method to new domains and problems [26, 54]. Therefore, it is natural to ask whether the RI algorithm can be generalised to higher-order relationships and distributional arrays.

In the next section we generalise RI of vectors to RI of multidimensional data in the form of matrices and higher-order arrays. Subsequently, we present results of simulation experiments of ordinary and generalised RI demonstrating some properties of the generalised method, including a comparison with principal component analysis (PCA). PCA and similar approximation methods for higher-order arrays such as Tucker decomposition [34] are expected to result in higher signal-to-noise ratio (SNR) than RI when applicable because the dimension reduction is optimised to minimise the residual variance. However, such methods are more complex and target another application domain. We conclude that the possibility to incrementally encode and analyse general co-occurrence relationships at low computational cost using a distributed representation of approximately fixed size makes generalised RI interesting for online processing of data streams.

## 2 Method

*c*is the number of items surrounding a word that defines the context window, \(w(x_j)\) is a weight function that quantifies the importance of a context item \(x_j\), and \(\pi _j\) is an optional permutation operator that makes the context window word-order dependent [45]. This way RI can, for example, be used to identify words that appear in similar contexts by analysing the inner products of the distributional vectors, \(\mathbf {s}\), thereby greatly simplifying the co-occurrence analysis problem outlined above.

In the following we refer to RI of vectors as one-way RI and generalise one-way RI to *n*-way RI of arrays \(a_{i_1, i_2, i_3, \ldots , i_\mathcal{N}}\) of arbitrary order, meaning that there are *n* sets of index vectors associated with each dimension of the array. We focus on the core RI mechanism and omit extensions like the word-order-dependent permutation introduced above in order to make the presentation more accessible. Array elements are denoted with \(a_{i_1, i_2, i_3, \ldots , i_\mathcal{N}}\), or \(a_{\bar{i}}\) for short, and the indices \(\{i_1, i_2, i_3,\ldots , i_\mathcal{N}\}\) are used in array element space. The array elements are encoded in a distributed fashion in *states* that are denoted with \(s_{\alpha _1, \alpha _2, \alpha _3, \ldots , \alpha _\mathcal{N}}\), or \(s_{\bar{\alpha }}\) for short. The indices \(\{\alpha _1, \alpha _2, \alpha _3, \ldots , \alpha _\mathcal{N}\}\) are used in state space. We use the notation \(i_\mathcal{D}\) when referring to indices of the array space and \(\alpha _\mathcal{D}\) when referring to indices of the state space, where \(\mathcal{D}\) is the dimension index. For vectors \(\mathcal{D}=1\), for matrices \(\mathcal{D} \in \{1,2\}\) and in general \(\mathcal{D} \in [1,\mathcal{N}]\). When necessary we use one additional index, \(j_\mathcal{D}\), in array element space. Similarly, one additional state-space index, \(\beta _\mathcal{D}\), is used when necessary.

States have physical representations that are stored in memory, but they are only accessed using particular decoder and encoder functions (introduced below) which generalise (1) for vectors to arrays of arbitrary order. The array elements are related to the states by a random projection [56] mechanism and constitute the input to the encoder function and the output from the decoder function, respectively. The order of the state array, \(\mathcal N\), is equivalent to that of the array. The core idea is that the state array can be of significantly smaller size than the array itself and that approximate vector semantic analysis can be performed in state space at low computational cost. Similarly, the set of distributional vectors, \(\mathbf {s}\), in (1) have few elements compared to the full co-occurrence matrix. This possibility follows from the Johnson–Lindenstrauss lemma [25], which describes the mapping of points in a high-dimensional space to a space of lower dimension so that the distances between points are approximately preserved. Further developments of the Johnson–Lindenstrauss lemma and related applications can be found, for example, in [1, 13, 18, 28, 38].

### 2.1 Random indexing

*index vector*

*trits*. This definition simplifies to ordinary RI, \(\mathbf {r}(x_j)\) in (1), in the case of \(\mathcal{N}=1\).

Summary of parameters

Expression | Description |
---|---|

\(a_{i_1, i_2, i_3, \ldots , i_\mathcal{N}}\), \(a_{\bar{i}}\) | Array elements |

\(s_{\alpha _1, \alpha _2, \alpha _3, \ldots , \alpha _\mathcal{N}}\), \(s_{\bar{\alpha }}\) | State array, accessed by encoder/decoder functions |

\(\mathcal{N}\) | Dimensionality of array |

\(\mathcal{D}\) | Dimension index, \(1 \le \mathcal{D} \le \mathcal{N}\) |

\(N_\mathcal{D}\) | Number of index vectors in dimension \(\mathcal{D}\), \(i_\mathcal{D} \in [1,N_\mathcal{D}]\) |

\(L_\mathcal{D}\) | Length of index vectors in dimension \(\mathcal{D}\), \(\alpha _\mathcal{D}\in [1,L_\mathcal{D}]\) |

\(\chi _\mathcal{D}\) | Number of nonzero trits in index vectors of dimension \(\mathcal{D}\) |

\(S_e=\prod _\mathcal{D}\chi _\mathcal{D}\) | Number of states that encode one array element |

\(S_s\propto \prod _\mathcal{D} L_\mathcal{D}\) | Disk/memory space required to store the state array |

\(S_r\propto \sum _\mathcal{D} N_\mathcal{D}\chi _\mathcal{D}\) | Disk/memory space required to store index vectors |

The notation and definitions introduced above are a direct generalisation of ordinary RI to arrays of arbitrary order. In particular, ordinary RI is defined by \(\mathcal{N}=1\). Note that the states defined here correspond to the elements of the distributional vectors in ordinary RI, which are the hard storage locations where the distributional statistics are stored. Next we present the corresponding generalised encoding algorithm and generalised method for vector semantic analysis in terms of inner products.

### 2.2 Encoding algorithm

*incrementally*updated, without modifying the random projections or recalculating substantial parts of the state array. Furthermore, this definition implies that the indices of an array element are used to select a particular set of index vectors, forming an outer product of “nearly orthogonal”, or so-called

*indifferent*index vectors in state space (see “Appendix” for further details).

*n*is \(\mathcal{O}(n)\) for RI of any order, which is lower than the complexity of streaming PCA [39]. Furthermore, new array elements (relationship weights) can be added to the representation with low impact on the representation size; see the discussion in Sect. 2.1. These two properties make the generalised RI algorithm interesting for streaming data applications.

Subtraction of \(w_{\bar{i}}\) is defined by the replacement \(w_{\bar{i}} \rightarrow -w_{\bar{i}}\) in (3). Assignment of array elements is not defined because of the distributional nature of the representation of array elements.

### 2.3 Decoding algorithm

The encoding procedure (3) is a sequence of outer products of indifferent index vectors, and the decoding procedure is the corresponding sequence of inner products. It follows from (3) and (5) that the decoded value is an exact reconstruction of the accumulated encoded weight if all index vectors are orthogonal. However, that process would be useless in the context considered here because no dimension reduction is achieved in that case. For index vectors of length \(L_\mathcal{D}\), at most \(L_\mathcal{D}\) linearly independent vectors can be constructed (a set of basis vectors). For high values of \(L_\mathcal{D}\) there are many more vectors that are approximately orthogonal; see “Appendix” for details, which makes it possible to encode and decode approximate array elements in a small state space provided that the data are sufficiently sparse (see Sect. 3 for details).

### 2.4 Generalised vector semantic analysis

The RI method that is used in natural language processing is based on distributional vectors [30, 31, 42, 43]. Each term that appears in a text corpus is associated with a distributional vector, and each context or document is associated with a ternary index vector; see (1). Therefore, a distributional vector corresponds to the states of a one-dimensional RI array, \(\mathcal{N} = 1\), and the conventional index vectors correspond to the ternary index vectors of that array. The definition of the encoding operation (3) reduces to ordinary RI (1) in the case of one-way RI of vectors and constitutes a natural generalisation of RI to higher-order arrays.

*k*to multiple values of \(\alpha _1\) and \(\beta _1\) for which there are nonzero contributions from the sum over states to the inner product. More specifically, the number of such values for \(\alpha _1\) and \(\beta _1\) is exactly \(\chi _1\) for each value of

*k*, which implies that there are \(\chi _1\) nonzero “terms” in each of the sums over \(\alpha _1\) and \(\beta _1\). Therefore, the explicit evaluation of the inner product involves pseudorandom sampling of state indices \(\alpha _1\) and \(\beta _1\), which can be approximated with an explicit sum over all possible values of these state indices. Therefore, the number of terms in the sums over

*k*and \(\alpha _1\) (and \(\beta _1\)) is reduced from \(\chi _1 N_1\) to \(L_1\), which is a significant improvement. This is analogous to ordinary RI, where the distributional vectors, \(\mathbf {s}\) in (1), are compared for similarity directly, without prior decoding (inverse random projection) of word–context co-occurrence weights. Furthermore, the accuracy of the approximation can be improved by averaging the states selected by the constant indices {\(i_2, i_3\ldots i_\mathcal{N}\)} and {\(j_2, j_3\ldots j_\mathcal{N}\)}, resulting in the following state-space approximation for the inner product

The approximation (12) is more efficient than (11) as a result of omitting the numerous projections from state space to decoded weight-vectors in the estimation of the inner product. Furthermore, the simulation experiments presented below show that the variance of the inner-product approximation error increases when replacing the expectation value operations in (12) by an explicit inner product in state space, but otherwise an explicit evaluation of the inner product is possible in principle. This is expected because the averaging operations reduce the influence of state-space noise, \(\epsilon _{\bar{\alpha }}\) in (678), on the approximate inner product. The computational cost of the expectation values in (12) is low thanks to the sparsity of index vectors, and the constant of proportionality depends on constant parameters only (\(\mathcal N\), \(L_\mathcal{D}\) and \(\chi _\mathcal{D}\)). Note that the expressions resulting from a different choice of constant indices, which represent the relationship to be compared, can be obtained by change of notation in the equations presented above. For example, instead of summing over \(\alpha _1=\beta _1\) in (12), it is possible to sum over \(\alpha _2=\beta _2\) and average over other indices in state space.

The generalised inner product approximation (12) and the encoding (3) and decoding (5) methods are available in the software implementation of generalised RI [47]. Next we present numerical results which demonstrate that the generalised methods that are introduced above are reasonable.

## 3 Simulation experiments

We study the generalised RI method presented above with numerical experiments. Ideally, analytical bounds should be derived for the error terms in (9) and the related approximation (12). However, the analysis is complicated because of the sparse ternary index vectors and the dependence on the structure of the data. Partial results are presented in “Appendix”, which may be useful for further development. The simulation experiments are conducted to verify that the proposed generalisation is feasible, and the results also demonstrate some characteristics of the method.

The approximation errors introduced when encoding and decoding array elements depend on some parameters, in particular the dimensionality of the array and the input data; the length of the index vectors, \(L_\mathcal{D}\); the number of nonzero trits in the index vectors, \(\chi _\mathcal{D}\); the dimension reduction, \(\Pi _\mathcal{D} N_\mathcal{D}\) : \(\Pi _\mathcal{D} L_\mathcal{D}\); and the characteristics of the data that is encoded.

### 3.1 Verification and comparison with PCA

The matrix has a diagonal band that is 50 elements wide. Therefore, nearby rows are similar (semantically related) vectors with high inner products compared to the inner products of more distant rows. A band matrix is used to simplify the graphical presentation and interpretation of the structure of the data and the reconstruction. However, because of the random projections involved in RI the particular structure of the data is not important. Similar RI results are expected for other data structures of comparable sparsity.

The middle panel of Fig. 1 illustrates the reconstructed matrix when 101 principal components are used, corresponding to a dimension reduction of about 25:1. This approximate representation of the band matrix is similar to the original, but the band on the main diagonal is more wide and additional band-like structures are visible. The PCA is performed with MATLAB with double precision floating point numbers. The lower panel of Fig. 1 displays the matrix reconstructed using two-dimensional RI, \(\mathcal{N} = 2\), for \(\chi _\mathcal{D}=8\) and \(L_\mathcal{D}=964\), which also corresponds to a dimension reduction of about 25:1. The RI analysis is based on signed 16-bit states, which in practice means that the RI representation of the matrix is about four times smaller than the PCA representation. The RI approximation of the matrix is similar to the original band matrix in the sense that the structure of the band is preserved, but there are significant approximation errors also in this case, which appears like noise in the figure.

In the case of PCA, the inner products are calculated from the full reconstructed band matrix displayed in the middle panel of Fig. 1, and both the average and standard deviation (shaded area) of the inner product are displayed in Fig. 2. The inner products approximated with RI at a comparable dimension reduction are calculated with (12), which means that the inner products are calculated directly in state space, without reference to the reconstruction displayed in Fig. 1.

For comparison purposes, we calculate the inner products also from the reconstructed band matrix displayed in the lower panel of Fig. 1 and find that the average inner products and standard deviation are consistent with those displayed in Fig. 2. Furthermore, when omitting the state-space averaging operations in (12) we find that the standard deviation increases by a factor of more than two (data not shown), which confirms that the averaging operations reduce the influence of noise. These results motivate the approximation presented in (12), which reduces the computational cost and variance of RI-approximated inner products. These results are also in line with previous results showing that random projection preserves similarities of structures in the data [5, 20].

The error introduced by the RI approximation has a significantly higher standard deviation than the error introduced by PCA. However, PCA introduces a variable bias in the average inner product, which is not observed in the RI results. When increasing the size and sparseness of the band matrix, we find that the standard deviation of RI-approximated inner products decreases and that the bias of the average inner product increases in the case of PCA (data not shown).

### 3.2 Decoding error and comparison with ordinary RI

The approximation errors of generalised RI and ordinary RI of distributional vectors are expected to be different because higher-order approximations involve additional random projections, each contributing to the noise of the representation. Therefore, we compare generalised and ordinary RI with simulation experiments. We consider an experiment where a matrix is approximated using ordinary and generalised RI at different sparsity levels. Matrices can be represented using ordinary, one-way RI if each column or row is treated as a vector, which is the method used in natural language processing.

We select a generic approach where each column of the matrix represents a *class*, and each row represents a possible *feature* of the classes. Therefore, the columns are feature vectors that can be used to calculate the similarity of the classes, and the columns in state space are distributional vectors that encode the similarity of the classes. This interpretation and terminology is introduced to simplify the presentation. An integer sampled from the flat distribution [0, 10] is added to each element of the matrix, which simulates noise in the data that makes the matrix non-sparse. The non-sparse noise is introduced to make the experiment more challenging, and the choice of distribution is arbitrary since we have no particular application in mind. In addition to the noise, a relatively sparse set of high-value weights, \(w_{ij}=100\), are added to the matrix. The high-value weights simulate features of the classes, which we want to distinguish from the noise.

The number of features is selected to be proportional to the size of the matrix, \(N_\mathcal{D}\), and we define the constant of proportionality as \(\rho \). We vary the relative number of features, \(\rho \), from 0.1 to 10% of the size of the matrix, \(N_\mathcal{D}\). The array elements are decoded with (5) for each class, and the set of \(\rho N_\mathcal{D}\) array elements with the highest values are identified. If not all elements representing encoded features of that class are identified in that set, some features are not correctly identified after decoding. In the following we present the number of features that are correctly identified. Unless stated otherwise, we use \(\chi _\mathcal{D}=8\) in the simulation experiments.

*independent*of dimensionality, provided that the dimensionality is reasonably high (\({\sim }10^3\) or higher because the variance explodes at low dimensionality). However, the

*standard deviation*of the number of correctly decoded features decreases with increasing dimensionality. Therefore, if the dimension reduction, \(\Pi _\mathcal{D} N_\mathcal{D}\) : \(\Pi _\mathcal{D} L_\mathcal{D}\), is kept constant and the number of encoded features is proportional to the size of the matrix, the effect of increasing the size of the matrix, and therefore the dimensionality of index vectors, is a reduction in the uncertainty of the number of correctly decoded features. This scaling behaviour is illustrated numerically in Fig. 3.

The average of the relative number of correctly decoded features is practically independent of the size of the matrix and the dimensionality of the index vectors, but the corresponding standard deviation decreases with increasing dimensionality of the index vectors. Note that the relative number of correctly decoded features first decreases with an increasing number of encoded features, as expected, and that it increases slightly for \({\gtrsim }8\)% features in the case of two-way RI. This effect is caused by the increasing probability of correctly identifying features by chance when the relative number of features increases. In the case of the ordinary one-way RI method, the standard deviation has a maximum at approximately 0.7–0.9% features.

#### 3.2.1 Effect of dimension reduction

#### 3.2.2 Effect of sparseness of the index vectors

### 3.3 Natural language processing example

Next, we apply the generalised RI method to a basic natural language processing task. In statistical models of natural language, it is common to construct a large co-occurrence matrix [52]. For example, this method is used in the Hyperspace Analogue to Language (HAL) [36, 37] and in Latent Semantic Analysis (LSA) [35], which are two pioneering models in the field. In practical applications, the number of words can be hundreds of thousands. The number of documents or contexts is also high, otherwise the statistical basis will be insufficient for the analysis. Therefore, word co-occurrence matrices tend to be large objects. The simple example considered below uses more than 5 billion matrix elements that represent word–context relationships. Fortunately, co-occurrence matrices can be approximated to make the semantic analysis less computationally costly. It was first demonstrated by Kanerva, Kristoferson and Holst [31] that one-way RI can be used to effectively encode co-occurrence matrices for semantic analysis; see Sahlgren [42, 43] and [30] for further details and developments.

The definition of “context” is model specific, but it typically involves a set of neighbouring words or one document. In HAL, the context is defined by a number of words that immediately surround a given word, whereas in LSA, the context is defined as the document where the word exists. Linguistically, the former relation can be described as a paradigmatic (semantic) relation, whereas the latter can be characterised as an associative (topical) relation. In the traditional RI algorithm, each word type that appears in the data is associated with a distributional vector, and each context is associated with a ternary index vector; see (1). If the context is defined in terms of the neighbouring words of a given word, which is the method that we use here, the distributional vectors are created by adding the index vectors (which can be weighted differently) of the nearest preceding and succeeding words every time a word occurs in the data [32]. If the context is defined as the document where the word exists, the distributional vectors are created by adding the index vectors of all of the documents where a word occurs, weighted with the frequency of the word in each document. In either case, a distributional vector is the sum of the weighted index vectors of all contexts where that word occurs.

The RI algorithm has been evaluated using various types of vocabulary tests, such as the synonymy part of the “Test of English as a Foreign Language” (TOEFL) [31, 43]. In the following, we reconsider the synonym identification task presented by Kanerva, Kristoferson and Holst [31] with three changes. First, we want to compare the one-way and two-way RI methods. Therefore, we encode the co-occurrence matrix using both one-way and two-way RI. Second, while Kanerva, Kristoferson and Holst [31] used the LSA definition of context, we use a strategy similar to that in HAL and define the context as a window that spans \({\pm } 2\) words away from the word itself. This method implies that for each occurrence of a word, there will be four additional word–word relationships encoded in the co-occurrence matrix. This strategy avoids the potential difficulty of defining document boundaries in streaming text, and it captures semantic relations between words rather than topical relations. The length of the context window is a parameter that affects the quantitative results presented here, but it is not essential for our qualitative discussion. The third difference compared with the study by Kanerva, Kristoferson and Holst [31] is that we do not introduce cut-offs on term frequencies to further improve the result. Words such as “the”, “at” and “be” have high frequencies that render the occurrences of more interesting combinations less significant. We note that this effect is stronger for the two-way RI method than for the one-way RI method. We include the complete word–context spectrum, including the high-frequency relationships, and we present results for two different transformations of the spectrum. In one case, we directly encode the unaltered frequencies, and in the other case, we encode the square root of the frequencies. The square root decreases the relative significance of high frequencies, which improves the result and illustrates the importance of the definition of the weight in the feature extraction method.

We construct the co-occurrence matrix from 37, 620 short high-school level articles in the TASA (Touchstone Applied Science Associates, Inc.) corpus. The text has been morphologically normalised so that each word appears in its base form [32]. The text contains 74, 183 word types that are encoded in a co-occurrence matrix with one-way and two-way RI. In the case of one-way RI, we use index vectors of length 1000, so that the dimension reduction is \(74{,}183\times 74{,}183~:~1000\times 74{,}183 \rightarrow 74~:~1\). In the case of two-way RI, we use a state array of size \(1000\times 74{,}183\), thereby maintaining the same dimension reduction ratio. We repeat the two-way RI calculations using a square state array of size \(8612 \times 8612\). There are numerous misspellings (low-frequency words) in the corpus, and the most frequent word is “the”, which occurs nearly 740, 000 times. At the second place is “be” with just over 420, 000 occurrences. Therefore, we define 32-bit states in the implementation of RI.

Example of a TOEFL synonym test

Word | Number of occurrences |
---|---|

Essential (given) | 855 |

Basic | 1920 |

Ordinary | 837 |

Eager | 480 |

Possible | 3348 |

The task to identify the correct synonym is addressed using the RI-encoded co-occurrence matrices, and the vector semantic comparison method described in Sect. 2.4. We consider 80 synonym tests, each comprising five words. Using ordinary RI, 38 out of the 80 synonym tests are solved correctly, meaning that the cosine of angle between the given word and correct synonym is maximum. Repeating the experiment with the square root of frequencies and ordinary RI, 43 out of the 80 synonym tests are solved correctly. Using two-way RI and a square state array of size \(8612 \times 8612\) only 24 out of 80 synonym tests are solved correctly. Repeating the experiment with the square root of frequencies and two-way RI we obtain a similar result, 24 out of the 80 synonym tests are solved correctly. However, repeating the two-way RI experiment with the square root of frequencies and a state array of size \(1000\times 74{,}183\) we obtain 34 correct results out of 80.

These results can be further improved using other preprocessing methods, for example, by introducing weighted context windows and cut-offs on the encoded relationship frequencies [32], or by defining the weights as the logarithm of frequencies divided by the conditional entropy of the context given the word [35]. Furthermore, in order to enable numerical simulations on a standard PC we only consider higher-order RI of one distributed representation, while there is one distributed representation for each class/term in the case of one-way RI. This limitation can be avoided in large-scale applications of RI at data centres, possibly leading to more favourable results. One benefit of the two-way RI method is that words can be defined on the fly with a minimum impact on the storage space needed. This property is interesting for the analysis of streaming data, where many occasional features may exist in the data that are not relevant for the long-term analysis.

## 4 Conclusions

Random indexing is a form of random projection with particularly low computational complexity, thanks to the high sparsity of the index vectors and the straightforward distributed coding of information. RI has numerous applications and has proven useful for solving challenging problems without introducing much complexity.

Here we generalise ordinary RI (1) of distributional vectors [31, 42] to RI of distributional arrays of arbitrary order, and we present results of simulation experiments with one- and two-way RI. The software implementation [47] of the key equations (3), (5) and (12) supports *N*-way RI. Ordinary RI is used in numerous applications in natural language processing, where the possibility to approximate data in a compressed representation that can be updated incrementally at low computational cost and complexity in an online manner is useful. Furthermore, the compressed data structure can be used for semantic analysis by approximating inner products of distributional arrays at low computational cost using (12), without prior decoding of data. These properties make RI interesting for the analysis of streaming data, in particular when explicit storage of data is infeasible. Incremental processing of streaming data is not explicitly investigated in the simulation experiments presented in this work, but the data sets considered are encoded in an incremental fashion, one item after the other. The low computational complexity and incremental character of the encoding algorithm (3) make applications to streaming data straightforward. Furthermore, the possibility to extend the length of the random indices and therefore dynamically extend the number of properties that can be compressed in the state array makes RI interesting for analysis of streaming data. The generalisation of RI from distributional vectors to distributional arrays opens up for analysis of higher-order relationships, for example context- or time-dependent associations of terms in streaming text (third order), or the context- and time-dependent associations between terms (fourth order).

Our simulation results confirm the expectation that the approximation error is lower for ordinary one-way RI compared to two-way RI at constant size of the distributed representation. This is expected because each random index of an array is associated with additional random projections, which adds to the state-space noise. The benefit of two-way RI is that multiple classes of features can be encoded in a distributed representation of constant size and that new features can be defined on the fly with low impact on the storage space required. This property is interesting for the analysis of higher-order relationships between features derived from streaming data, where the number of potential features and relationships can be astronomical.

Higher-order RI can be applied to multiple distributional arrays, just like ordinary RI (1) typically is applied to a set of distributional vectors, \(\mathbf {s}_i\). For example, in ordinary RI of co-occurrence matrices each column of the co-occurrence matrix is represented in a distributional vector, and each row of the co-occurrence matrix refers to an index vector that defines a random projection on the distributional vectors. This way the effect of state-space noise on the accuracy of term representations is minimised because each term is associated with a unique distributional vector. However, in this case the size of the distributed representation is proportional to the number of terms, which limits the scalability of the approach. Depending on the application requirements this principle can also be applied to higher-order RI when balancing between reasonable approximation errors, data storage demands and computational cost.

From a technical point of view we note that RI requires index vectors of high dimensionality (typically \(n>10^3\)), otherwise the variance related to the approximate random projections explodes and renders the approach practically useless. This tendency is well described by Kanerva [30] in his paper about computation in “hyperdimensional” spaces, and we have observed a similar effect in earlier work on distributional models based on high-dimensional random projections [15, 16]. For high-dimensional representations we find that the variances of approximated inner products and decoded weights decrease with increasing dimensionality and that the expectation values are practically invariant with respect to dimensionality. Furthermore, we find that the number of nonzero trits in the index vectors, \(\chi _\mathcal{D}\), has an effect on the accuracy of RI. The accuracy increases notably when increasing \(\chi _\mathcal{D}\) from two to four, but not much beyond \(\chi _\mathcal{D}=8\). Therefore, our simulation experiments suggest that \(\chi _\mathcal{D}=8\) is the preferred choice for this hyperparameter since the computational cost of the encoding and semantic analysis algorithms increase with \(\chi _\mathcal{D}\).

In summary, RI is an incremental dimension reduction method that is computationally lightweight and well suited for online processing of streaming data not feasible to analyse with other, more accurate and complex methods [50]. For example, standard co-occurrence matrices in natural language processing applications can be extended with temporal information [26], linguistic relations [3, 53] and structural information in distributed representations [9, 59]. There have been few attempts at extending traditional matrix-based natural language processing methods to higher-order arrays due to the high computational cost involved. This is something that multidimensional RI is likely to facilitate.

## Acknowledgements

The TASA and TOEFL items have been kindly provided by Professor Thomas Landauer, University of Colorado. We thank Pentti Kanerva for comments on an early draft of the manuscript [48] and anonymous referees for useful comments and suggestions that helped us improve the clarity of the manuscript. The possibility to extend RI to two dimensions is mentioned in [30] at p. 153. This work is partially funded by the Kempe Foundations.

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.