Random indexing of multidimensional data
Abstract
Random indexing (RI) is a lightweight dimension reduction method, which is used, for example, to approximate vector semantic relationships in online natural language processing systems. Here we generalise RI to multidimensional arrays and therefore enable approximation of higherorder statistical relationships in data. The generalised method is a sparse implementation of random projections, which is the theoretical basis also for ordinary RI and other randomisation approaches to dimensionality reduction and data representation. We present numerical experiments which demonstrate that a multidimensional generalisation of RI is feasible, including comparisons with ordinary RI and principal component analysis. The RI method is well suited for online processing of data streams because relationship weights can be updated incrementally in a fixedsize distributed representation, and inner products can be approximated on the fly at low computational cost. An open source implementation of generalised RI is provided.
Keywords
Data mining Random embeddings Dimensionality reduction Sparse coding Semantic similarity Streaming algorithm Natural language processing1 Introduction
There is a rapid increase in the annual amount of data that is produced in almost all domains of science, industry, economy, medicine and even everyday life. We have surpassed a critical point where more data are generated than we can physically store. Choosing which data to archive and process and which to discard is necessary in dataintensive applications. That trend motivates the development of new methods for data representation and analysis [2, 23, 49].
One interesting approach to analyse large data sets is to search for outstanding relationships between “features” in the data. Problems of that type naturally appear in the form of context or timedependent relationships. For example, the cooccurrence of words in articles, blogs and so on is one type of relationship that carries information about language use and evolution over time [46]. Similarly, cooccurrence analysis can be used to investigate the general opinion about things, for example public events or politicians, and how the opinion changes over time [51]. The analysis requires averaging over many instances of relationships in order to identify frequent or otherwise significant patterns in a noiselike background. That is a nontrivial problem because the number of possible relationships between elements of the sets \(A_i\) scales like \(\mathcal{O}(\Pi _i A_i)\), where \(A_i\) denotes the cardinality of the set \(A_i\). In the example of online text analysis \(A\sim 10^5\), which implies \(\sim \!\!10^{10}\) cooccurrence weights that evolve over time and typically depend on additional context variables of interest. Therefore, the number of relationship weights that need to be stored and updated in such applications can be astronomical, and the analysis prohibitive given the large size of the data representation.
This is the motivation of random indexing (RI) [31], which is a randomprojection method that solves such problems by incrementally generating distributional representations that approximate similarities in sets of cooccurrence weights. For example, in the context of natural language processing the RI method is used to compress large word–document or word–context cooccurrence matrices [52]. This is done by associating each document or context with a sparse random ternary vector of high dimensionality [29, 30], a socalled index vector. Each word is also represented by a highdimensional vector of integers, a socalled distributional vector. These distributional vectors are initially set to zero, and for each appearance of a particular word in a context, the index vector of that context is added to the distributional vector of the word. The result of this incremental process is that words that appear in similar contexts get similar distributional vectors, indicating that they are semantically related [44]. Therefore, the analysis of semantic similarity can be performed by comparing the compressed distributional vectors in terms of inner products, instead of analysing and storing the full cooccurrence matrix. The distributional vectors can be updated on the fly in streaming applications by adding the appropriate sparse index vectors and cooccurrence weights to the distributional vectors. See Sahlgren [42, 43] for further details.
LSA [14] and HAL [36] are two other prominent examples of vectorspace models [52] used for semantic analysis of text. In these methods, a cooccurrence matrix is explicitly constructed, and then singular value decomposition (SVD) is used to identify the semantic relationships between terms (see [8] for recent examples). This process requires significant storage space for the full cooccurrence matrix, and it is a computationally costly method. The SVD can be calculated using parallel and iterative methods optimised for sparse matrices [4], but the computational cost still prevents the processing of large and streaming data sets [11]. In contrast, RI easily scales to large corpora such as the MEDLINE collection of approximately 9 million abstracts [10]. Another approach known as localitysensitive hashing (LSH) [7] is compared with RI on a distributional similarity task by Gorman and Curran [21], showing that RI outperforms LSH in terms of efficiency and accuracy when the problem size increases. RI requires a fraction of the memory and processing power of LSA and HAL [11], but is comparable with models based on SVD in terms of accuracy. For example, the accuracy of RI is comparable to SVDbased methods in a TOEFL synonym identification task [31], and that result has been further improved in the case of RI [45]. RI of cooccurrence matrices for semantic analysis works surprisingly well [11, 30, 43, 52], and the method has been adopted in other applications, such as indexing of literature databases [54], event detection in blogs [26], web user clustering and page prefetching [57], graph searching for the semantic web [12], diagnosis code assignment to patients [24], predicting speculation in biomedical literature [55] and failure prediction [19]. In general, there is an increasing interest for randomisation in information processing because it enables the use of simple algorithms, which can be organised to exploit parallel computation in an efficient way [6, 22].
The practical usefulness of RI is also demonstrated by several implementations in public software packages such as the SSpace Package [27] and the Semantic Vectors Package [58], and extensions of the basic method to new domains and problems [26, 54]. Therefore, it is natural to ask whether the RI algorithm can be generalised to higherorder relationships and distributional arrays.
In the next section we generalise RI of vectors to RI of multidimensional data in the form of matrices and higherorder arrays. Subsequently, we present results of simulation experiments of ordinary and generalised RI demonstrating some properties of the generalised method, including a comparison with principal component analysis (PCA). PCA and similar approximation methods for higherorder arrays such as Tucker decomposition [34] are expected to result in higher signaltonoise ratio (SNR) than RI when applicable because the dimension reduction is optimised to minimise the residual variance. However, such methods are more complex and target another application domain. We conclude that the possibility to incrementally encode and analyse general cooccurrence relationships at low computational cost using a distributed representation of approximately fixed size makes generalised RI interesting for online processing of data streams.
2 Method
In the following we refer to RI of vectors as oneway RI and generalise oneway RI to nway RI of arrays \(a_{i_1, i_2, i_3, \ldots , i_\mathcal{N}}\) of arbitrary order, meaning that there are n sets of index vectors associated with each dimension of the array. We focus on the core RI mechanism and omit extensions like the wordorderdependent permutation introduced above in order to make the presentation more accessible. Array elements are denoted with \(a_{i_1, i_2, i_3, \ldots , i_\mathcal{N}}\), or \(a_{\bar{i}}\) for short, and the indices \(\{i_1, i_2, i_3,\ldots , i_\mathcal{N}\}\) are used in array element space. The array elements are encoded in a distributed fashion in states that are denoted with \(s_{\alpha _1, \alpha _2, \alpha _3, \ldots , \alpha _\mathcal{N}}\), or \(s_{\bar{\alpha }}\) for short. The indices \(\{\alpha _1, \alpha _2, \alpha _3, \ldots , \alpha _\mathcal{N}\}\) are used in state space. We use the notation \(i_\mathcal{D}\) when referring to indices of the array space and \(\alpha _\mathcal{D}\) when referring to indices of the state space, where \(\mathcal{D}\) is the dimension index. For vectors \(\mathcal{D}=1\), for matrices \(\mathcal{D} \in \{1,2\}\) and in general \(\mathcal{D} \in [1,\mathcal{N}]\). When necessary we use one additional index, \(j_\mathcal{D}\), in array element space. Similarly, one additional statespace index, \(\beta _\mathcal{D}\), is used when necessary.
States have physical representations that are stored in memory, but they are only accessed using particular decoder and encoder functions (introduced below) which generalise (1) for vectors to arrays of arbitrary order. The array elements are related to the states by a random projection [56] mechanism and constitute the input to the encoder function and the output from the decoder function, respectively. The order of the state array, \(\mathcal N\), is equivalent to that of the array. The core idea is that the state array can be of significantly smaller size than the array itself and that approximate vector semantic analysis can be performed in state space at low computational cost. Similarly, the set of distributional vectors, \(\mathbf {s}\), in (1) have few elements compared to the full cooccurrence matrix. This possibility follows from the Johnson–Lindenstrauss lemma [25], which describes the mapping of points in a highdimensional space to a space of lower dimension so that the distances between points are approximately preserved. Further developments of the Johnson–Lindenstrauss lemma and related applications can be found, for example, in [1, 13, 18, 28, 38].
2.1 Random indexing
Summary of parameters
Expression  Description 

\(a_{i_1, i_2, i_3, \ldots , i_\mathcal{N}}\), \(a_{\bar{i}}\)  Array elements 
\(s_{\alpha _1, \alpha _2, \alpha _3, \ldots , \alpha _\mathcal{N}}\), \(s_{\bar{\alpha }}\)  State array, accessed by encoder/decoder functions 
\(\mathcal{N}\)  Dimensionality of array 
\(\mathcal{D}\)  Dimension index, \(1 \le \mathcal{D} \le \mathcal{N}\) 
\(N_\mathcal{D}\)  Number of index vectors in dimension \(\mathcal{D}\), \(i_\mathcal{D} \in [1,N_\mathcal{D}]\) 
\(L_\mathcal{D}\)  Length of index vectors in dimension \(\mathcal{D}\), \(\alpha _\mathcal{D}\in [1,L_\mathcal{D}]\) 
\(\chi _\mathcal{D}\)  Number of nonzero trits in index vectors of dimension \(\mathcal{D}\) 
\(S_e=\prod _\mathcal{D}\chi _\mathcal{D}\)  Number of states that encode one array element 
\(S_s\propto \prod _\mathcal{D} L_\mathcal{D}\)  Disk/memory space required to store the state array 
\(S_r\propto \sum _\mathcal{D} N_\mathcal{D}\chi _\mathcal{D}\)  Disk/memory space required to store index vectors 
The notation and definitions introduced above are a direct generalisation of ordinary RI to arrays of arbitrary order. In particular, ordinary RI is defined by \(\mathcal{N}=1\). Note that the states defined here correspond to the elements of the distributional vectors in ordinary RI, which are the hard storage locations where the distributional statistics are stored. Next we present the corresponding generalised encoding algorithm and generalised method for vector semantic analysis in terms of inner products.
2.2 Encoding algorithm
Subtraction of \(w_{\bar{i}}\) is defined by the replacement \(w_{\bar{i}} \rightarrow w_{\bar{i}}\) in (3). Assignment of array elements is not defined because of the distributional nature of the representation of array elements.
2.3 Decoding algorithm
The encoding procedure (3) is a sequence of outer products of indifferent index vectors, and the decoding procedure is the corresponding sequence of inner products. It follows from (3) and (5) that the decoded value is an exact reconstruction of the accumulated encoded weight if all index vectors are orthogonal. However, that process would be useless in the context considered here because no dimension reduction is achieved in that case. For index vectors of length \(L_\mathcal{D}\), at most \(L_\mathcal{D}\) linearly independent vectors can be constructed (a set of basis vectors). For high values of \(L_\mathcal{D}\) there are many more vectors that are approximately orthogonal; see “Appendix” for details, which makes it possible to encode and decode approximate array elements in a small state space provided that the data are sufficiently sparse (see Sect. 3 for details).
2.4 Generalised vector semantic analysis
The RI method that is used in natural language processing is based on distributional vectors [30, 31, 42, 43]. Each term that appears in a text corpus is associated with a distributional vector, and each context or document is associated with a ternary index vector; see (1). Therefore, a distributional vector corresponds to the states of a onedimensional RI array, \(\mathcal{N} = 1\), and the conventional index vectors correspond to the ternary index vectors of that array. The definition of the encoding operation (3) reduces to ordinary RI (1) in the case of oneway RI of vectors and constitutes a natural generalisation of RI to higherorder arrays.
The approximation (12) is more efficient than (11) as a result of omitting the numerous projections from state space to decoded weightvectors in the estimation of the inner product. Furthermore, the simulation experiments presented below show that the variance of the innerproduct approximation error increases when replacing the expectation value operations in (12) by an explicit inner product in state space, but otherwise an explicit evaluation of the inner product is possible in principle. This is expected because the averaging operations reduce the influence of statespace noise, \(\epsilon _{\bar{\alpha }}\) in (678), on the approximate inner product. The computational cost of the expectation values in (12) is low thanks to the sparsity of index vectors, and the constant of proportionality depends on constant parameters only (\(\mathcal N\), \(L_\mathcal{D}\) and \(\chi _\mathcal{D}\)). Note that the expressions resulting from a different choice of constant indices, which represent the relationship to be compared, can be obtained by change of notation in the equations presented above. For example, instead of summing over \(\alpha _1=\beta _1\) in (12), it is possible to sum over \(\alpha _2=\beta _2\) and average over other indices in state space.
The generalised inner product approximation (12) and the encoding (3) and decoding (5) methods are available in the software implementation of generalised RI [47]. Next we present numerical results which demonstrate that the generalised methods that are introduced above are reasonable.
3 Simulation experiments
We study the generalised RI method presented above with numerical experiments. Ideally, analytical bounds should be derived for the error terms in (9) and the related approximation (12). However, the analysis is complicated because of the sparse ternary index vectors and the dependence on the structure of the data. Partial results are presented in “Appendix”, which may be useful for further development. The simulation experiments are conducted to verify that the proposed generalisation is feasible, and the results also demonstrate some characteristics of the method.
The approximation errors introduced when encoding and decoding array elements depend on some parameters, in particular the dimensionality of the array and the input data; the length of the index vectors, \(L_\mathcal{D}\); the number of nonzero trits in the index vectors, \(\chi _\mathcal{D}\); the dimension reduction, \(\Pi _\mathcal{D} N_\mathcal{D}\) : \(\Pi _\mathcal{D} L_\mathcal{D}\); and the characteristics of the data that is encoded.
3.1 Verification and comparison with PCA
The matrix has a diagonal band that is 50 elements wide. Therefore, nearby rows are similar (semantically related) vectors with high inner products compared to the inner products of more distant rows. A band matrix is used to simplify the graphical presentation and interpretation of the structure of the data and the reconstruction. However, because of the random projections involved in RI the particular structure of the data is not important. Similar RI results are expected for other data structures of comparable sparsity.
The middle panel of Fig. 1 illustrates the reconstructed matrix when 101 principal components are used, corresponding to a dimension reduction of about 25:1. This approximate representation of the band matrix is similar to the original, but the band on the main diagonal is more wide and additional bandlike structures are visible. The PCA is performed with MATLAB with double precision floating point numbers. The lower panel of Fig. 1 displays the matrix reconstructed using twodimensional RI, \(\mathcal{N} = 2\), for \(\chi _\mathcal{D}=8\) and \(L_\mathcal{D}=964\), which also corresponds to a dimension reduction of about 25:1. The RI analysis is based on signed 16bit states, which in practice means that the RI representation of the matrix is about four times smaller than the PCA representation. The RI approximation of the matrix is similar to the original band matrix in the sense that the structure of the band is preserved, but there are significant approximation errors also in this case, which appears like noise in the figure.
In the case of PCA, the inner products are calculated from the full reconstructed band matrix displayed in the middle panel of Fig. 1, and both the average and standard deviation (shaded area) of the inner product are displayed in Fig. 2. The inner products approximated with RI at a comparable dimension reduction are calculated with (12), which means that the inner products are calculated directly in state space, without reference to the reconstruction displayed in Fig. 1.
For comparison purposes, we calculate the inner products also from the reconstructed band matrix displayed in the lower panel of Fig. 1 and find that the average inner products and standard deviation are consistent with those displayed in Fig. 2. Furthermore, when omitting the statespace averaging operations in (12) we find that the standard deviation increases by a factor of more than two (data not shown), which confirms that the averaging operations reduce the influence of noise. These results motivate the approximation presented in (12), which reduces the computational cost and variance of RIapproximated inner products. These results are also in line with previous results showing that random projection preserves similarities of structures in the data [5, 20].
The error introduced by the RI approximation has a significantly higher standard deviation than the error introduced by PCA. However, PCA introduces a variable bias in the average inner product, which is not observed in the RI results. When increasing the size and sparseness of the band matrix, we find that the standard deviation of RIapproximated inner products decreases and that the bias of the average inner product increases in the case of PCA (data not shown).
3.2 Decoding error and comparison with ordinary RI
The approximation errors of generalised RI and ordinary RI of distributional vectors are expected to be different because higherorder approximations involve additional random projections, each contributing to the noise of the representation. Therefore, we compare generalised and ordinary RI with simulation experiments. We consider an experiment where a matrix is approximated using ordinary and generalised RI at different sparsity levels. Matrices can be represented using ordinary, oneway RI if each column or row is treated as a vector, which is the method used in natural language processing.
We select a generic approach where each column of the matrix represents a class, and each row represents a possible feature of the classes. Therefore, the columns are feature vectors that can be used to calculate the similarity of the classes, and the columns in state space are distributional vectors that encode the similarity of the classes. This interpretation and terminology is introduced to simplify the presentation. An integer sampled from the flat distribution [0, 10] is added to each element of the matrix, which simulates noise in the data that makes the matrix nonsparse. The nonsparse noise is introduced to make the experiment more challenging, and the choice of distribution is arbitrary since we have no particular application in mind. In addition to the noise, a relatively sparse set of highvalue weights, \(w_{ij}=100\), are added to the matrix. The highvalue weights simulate features of the classes, which we want to distinguish from the noise.
The number of features is selected to be proportional to the size of the matrix, \(N_\mathcal{D}\), and we define the constant of proportionality as \(\rho \). We vary the relative number of features, \(\rho \), from 0.1 to 10% of the size of the matrix, \(N_\mathcal{D}\). The array elements are decoded with (5) for each class, and the set of \(\rho N_\mathcal{D}\) array elements with the highest values are identified. If not all elements representing encoded features of that class are identified in that set, some features are not correctly identified after decoding. In the following we present the number of features that are correctly identified. Unless stated otherwise, we use \(\chi _\mathcal{D}=8\) in the simulation experiments.
The average of the relative number of correctly decoded features is practically independent of the size of the matrix and the dimensionality of the index vectors, but the corresponding standard deviation decreases with increasing dimensionality of the index vectors. Note that the relative number of correctly decoded features first decreases with an increasing number of encoded features, as expected, and that it increases slightly for \({\gtrsim }8\)% features in the case of twoway RI. This effect is caused by the increasing probability of correctly identifying features by chance when the relative number of features increases. In the case of the ordinary oneway RI method, the standard deviation has a maximum at approximately 0.7–0.9% features.
3.2.1 Effect of dimension reduction
3.2.2 Effect of sparseness of the index vectors
3.3 Natural language processing example
Next, we apply the generalised RI method to a basic natural language processing task. In statistical models of natural language, it is common to construct a large cooccurrence matrix [52]. For example, this method is used in the Hyperspace Analogue to Language (HAL) [36, 37] and in Latent Semantic Analysis (LSA) [35], which are two pioneering models in the field. In practical applications, the number of words can be hundreds of thousands. The number of documents or contexts is also high, otherwise the statistical basis will be insufficient for the analysis. Therefore, word cooccurrence matrices tend to be large objects. The simple example considered below uses more than 5 billion matrix elements that represent word–context relationships. Fortunately, cooccurrence matrices can be approximated to make the semantic analysis less computationally costly. It was first demonstrated by Kanerva, Kristoferson and Holst [31] that oneway RI can be used to effectively encode cooccurrence matrices for semantic analysis; see Sahlgren [42, 43] and [30] for further details and developments.
The definition of “context” is model specific, but it typically involves a set of neighbouring words or one document. In HAL, the context is defined by a number of words that immediately surround a given word, whereas in LSA, the context is defined as the document where the word exists. Linguistically, the former relation can be described as a paradigmatic (semantic) relation, whereas the latter can be characterised as an associative (topical) relation. In the traditional RI algorithm, each word type that appears in the data is associated with a distributional vector, and each context is associated with a ternary index vector; see (1). If the context is defined in terms of the neighbouring words of a given word, which is the method that we use here, the distributional vectors are created by adding the index vectors (which can be weighted differently) of the nearest preceding and succeeding words every time a word occurs in the data [32]. If the context is defined as the document where the word exists, the distributional vectors are created by adding the index vectors of all of the documents where a word occurs, weighted with the frequency of the word in each document. In either case, a distributional vector is the sum of the weighted index vectors of all contexts where that word occurs.
The RI algorithm has been evaluated using various types of vocabulary tests, such as the synonymy part of the “Test of English as a Foreign Language” (TOEFL) [31, 43]. In the following, we reconsider the synonym identification task presented by Kanerva, Kristoferson and Holst [31] with three changes. First, we want to compare the oneway and twoway RI methods. Therefore, we encode the cooccurrence matrix using both oneway and twoway RI. Second, while Kanerva, Kristoferson and Holst [31] used the LSA definition of context, we use a strategy similar to that in HAL and define the context as a window that spans \({\pm } 2\) words away from the word itself. This method implies that for each occurrence of a word, there will be four additional word–word relationships encoded in the cooccurrence matrix. This strategy avoids the potential difficulty of defining document boundaries in streaming text, and it captures semantic relations between words rather than topical relations. The length of the context window is a parameter that affects the quantitative results presented here, but it is not essential for our qualitative discussion. The third difference compared with the study by Kanerva, Kristoferson and Holst [31] is that we do not introduce cutoffs on term frequencies to further improve the result. Words such as “the”, “at” and “be” have high frequencies that render the occurrences of more interesting combinations less significant. We note that this effect is stronger for the twoway RI method than for the oneway RI method. We include the complete word–context spectrum, including the highfrequency relationships, and we present results for two different transformations of the spectrum. In one case, we directly encode the unaltered frequencies, and in the other case, we encode the square root of the frequencies. The square root decreases the relative significance of high frequencies, which improves the result and illustrates the importance of the definition of the weight in the feature extraction method.
We construct the cooccurrence matrix from 37, 620 short highschool level articles in the TASA (Touchstone Applied Science Associates, Inc.) corpus. The text has been morphologically normalised so that each word appears in its base form [32]. The text contains 74, 183 word types that are encoded in a cooccurrence matrix with oneway and twoway RI. In the case of oneway RI, we use index vectors of length 1000, so that the dimension reduction is \(74{,}183\times 74{,}183~:~1000\times 74{,}183 \rightarrow 74~:~1\). In the case of twoway RI, we use a state array of size \(1000\times 74{,}183\), thereby maintaining the same dimension reduction ratio. We repeat the twoway RI calculations using a square state array of size \(8612 \times 8612\). There are numerous misspellings (lowfrequency words) in the corpus, and the most frequent word is “the”, which occurs nearly 740, 000 times. At the second place is “be” with just over 420, 000 occurrences. Therefore, we define 32bit states in the implementation of RI.
Example of a TOEFL synonym test
Word  Number of occurrences 

Essential (given)  855 
Basic  1920 
Ordinary  837 
Eager  480 
Possible  3348 
The task to identify the correct synonym is addressed using the RIencoded cooccurrence matrices, and the vector semantic comparison method described in Sect. 2.4. We consider 80 synonym tests, each comprising five words. Using ordinary RI, 38 out of the 80 synonym tests are solved correctly, meaning that the cosine of angle between the given word and correct synonym is maximum. Repeating the experiment with the square root of frequencies and ordinary RI, 43 out of the 80 synonym tests are solved correctly. Using twoway RI and a square state array of size \(8612 \times 8612\) only 24 out of 80 synonym tests are solved correctly. Repeating the experiment with the square root of frequencies and twoway RI we obtain a similar result, 24 out of the 80 synonym tests are solved correctly. However, repeating the twoway RI experiment with the square root of frequencies and a state array of size \(1000\times 74{,}183\) we obtain 34 correct results out of 80.
These results can be further improved using other preprocessing methods, for example, by introducing weighted context windows and cutoffs on the encoded relationship frequencies [32], or by defining the weights as the logarithm of frequencies divided by the conditional entropy of the context given the word [35]. Furthermore, in order to enable numerical simulations on a standard PC we only consider higherorder RI of one distributed representation, while there is one distributed representation for each class/term in the case of oneway RI. This limitation can be avoided in largescale applications of RI at data centres, possibly leading to more favourable results. One benefit of the twoway RI method is that words can be defined on the fly with a minimum impact on the storage space needed. This property is interesting for the analysis of streaming data, where many occasional features may exist in the data that are not relevant for the longterm analysis.
4 Conclusions
Random indexing is a form of random projection with particularly low computational complexity, thanks to the high sparsity of the index vectors and the straightforward distributed coding of information. RI has numerous applications and has proven useful for solving challenging problems without introducing much complexity.
Here we generalise ordinary RI (1) of distributional vectors [31, 42] to RI of distributional arrays of arbitrary order, and we present results of simulation experiments with one and twoway RI. The software implementation [47] of the key equations (3), (5) and (12) supports Nway RI. Ordinary RI is used in numerous applications in natural language processing, where the possibility to approximate data in a compressed representation that can be updated incrementally at low computational cost and complexity in an online manner is useful. Furthermore, the compressed data structure can be used for semantic analysis by approximating inner products of distributional arrays at low computational cost using (12), without prior decoding of data. These properties make RI interesting for the analysis of streaming data, in particular when explicit storage of data is infeasible. Incremental processing of streaming data is not explicitly investigated in the simulation experiments presented in this work, but the data sets considered are encoded in an incremental fashion, one item after the other. The low computational complexity and incremental character of the encoding algorithm (3) make applications to streaming data straightforward. Furthermore, the possibility to extend the length of the random indices and therefore dynamically extend the number of properties that can be compressed in the state array makes RI interesting for analysis of streaming data. The generalisation of RI from distributional vectors to distributional arrays opens up for analysis of higherorder relationships, for example context or timedependent associations of terms in streaming text (third order), or the context and timedependent associations between terms (fourth order).
Our simulation results confirm the expectation that the approximation error is lower for ordinary oneway RI compared to twoway RI at constant size of the distributed representation. This is expected because each random index of an array is associated with additional random projections, which adds to the statespace noise. The benefit of twoway RI is that multiple classes of features can be encoded in a distributed representation of constant size and that new features can be defined on the fly with low impact on the storage space required. This property is interesting for the analysis of higherorder relationships between features derived from streaming data, where the number of potential features and relationships can be astronomical.
Higherorder RI can be applied to multiple distributional arrays, just like ordinary RI (1) typically is applied to a set of distributional vectors, \(\mathbf {s}_i\). For example, in ordinary RI of cooccurrence matrices each column of the cooccurrence matrix is represented in a distributional vector, and each row of the cooccurrence matrix refers to an index vector that defines a random projection on the distributional vectors. This way the effect of statespace noise on the accuracy of term representations is minimised because each term is associated with a unique distributional vector. However, in this case the size of the distributed representation is proportional to the number of terms, which limits the scalability of the approach. Depending on the application requirements this principle can also be applied to higherorder RI when balancing between reasonable approximation errors, data storage demands and computational cost.
From a technical point of view we note that RI requires index vectors of high dimensionality (typically \(n>10^3\)), otherwise the variance related to the approximate random projections explodes and renders the approach practically useless. This tendency is well described by Kanerva [30] in his paper about computation in “hyperdimensional” spaces, and we have observed a similar effect in earlier work on distributional models based on highdimensional random projections [15, 16]. For highdimensional representations we find that the variances of approximated inner products and decoded weights decrease with increasing dimensionality and that the expectation values are practically invariant with respect to dimensionality. Furthermore, we find that the number of nonzero trits in the index vectors, \(\chi _\mathcal{D}\), has an effect on the accuracy of RI. The accuracy increases notably when increasing \(\chi _\mathcal{D}\) from two to four, but not much beyond \(\chi _\mathcal{D}=8\). Therefore, our simulation experiments suggest that \(\chi _\mathcal{D}=8\) is the preferred choice for this hyperparameter since the computational cost of the encoding and semantic analysis algorithms increase with \(\chi _\mathcal{D}\).
In summary, RI is an incremental dimension reduction method that is computationally lightweight and well suited for online processing of streaming data not feasible to analyse with other, more accurate and complex methods [50]. For example, standard cooccurrence matrices in natural language processing applications can be extended with temporal information [26], linguistic relations [3, 53] and structural information in distributed representations [9, 59]. There have been few attempts at extending traditional matrixbased natural language processing methods to higherorder arrays due to the high computational cost involved. This is something that multidimensional RI is likely to facilitate.
Acknowledgements
The TASA and TOEFL items have been kindly provided by Professor Thomas Landauer, University of Colorado. We thank Pentti Kanerva for comments on an early draft of the manuscript [48] and anonymous referees for useful comments and suggestions that helped us improve the clarity of the manuscript. The possibility to extend RI to two dimensions is mentioned in [30] at p. 153. This work is partially funded by the Kempe Foundations.
Funding information
Funder Name  Grant Number  Funding Note 

The Kempe Foundations 

Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.