Disentangling narrow and coarse semantic networks in the brain: The role of computational models of word meaning
There has been a recent boom in research relating semantic space computational models to fMRI data, in an effort to better understand how the brain represents semantic information. In the first study reported here, we expanded on a previous study to examine how different semantic space models and modeling parameters affect the abilities of these computational models to predict brain activation in a data-driven set of 500 selected voxels. The findings suggest that these computational models may contain distinct types of semantic information that relate to different brain areas in different ways. On the basis of these findings, in a second study we conducted an additional exploratory analysis of theoretically motivated brain regions in the language network. We demonstrated that data-driven computational models can be successfully integrated into theoretical frameworks to inform and test theories of semantic representation and processing. The findings from our work are discussed in light of future directions for neuroimaging and computational research.
KeywordsLSA HAL Semantic space models Coarse semantic coding fMRI
Latent semantic analysis (LSA; Landauer & Dumais, 1997) and the hyperspace analogue to language (HAL; Lund & Burgess, 1996) are among the most influential computational models of word meaning. LSA and HAL, among other so-called “semantic space models” or “distributional semantic models,” use word co-occurrence frequencies as the basic building blocks for word meaning (see Jones, Willits, & Dennis, 2015, for a recent review). In these models, the co-occurrence frequencies of a word with all the other documents (as in LSA) or all other words with which the word occurs (as in HAL) are used to build the vector representation for that word, typically based on a very large-scale text corpus. The resulting representation of any target word is a high-dimensional vector with each dimension denoting either a word (word-to-word matrix) or a document (word-to-document matrix). The raw vectors may consist of thousands or tens of thousands of dimensions and are usually very sparse. Dimension reduction methods are often used to reduce the number of dimensions in these models. These standard methods used by LSA and HAL have since been further developed or expanded. For example, probabilistic LSA (Hoffman, 2001) and its fully Bayesian extension the Topic model (Griffiths, Steyvers, & Tenenbaum, 2007) can identify lexemes with multiple senses (Tomar et al., 2013) and generate semantic representations as probability distributions rather than points in a high-dimension space. Positive pointwise mutual information (PPMI) has been used in place of raw co-occurrence frequencies (Bullinaria & Levy, 2007). Zhao, Li, and Kohonen (2011) integrated these models into a self-organizing map framework, and Fyshe, Talukdar, Murphy, and Mitchell (2013) discussed how different types of constraints on what counts as a co-occurrence qualitatively affect semantic information.
Evaluation of semantic space models
Computational models of word meanings have been evaluated in a variety of contexts and by a number of methods, most notably with word similarity judgment tasks. Word similarity judgment tasks involve tasks such as grouping words into categories or choosing a word that has the most similar definition to a target word. For example, LSA was originally evaluated by its ability to correctly identify synonyms on a portion of the Test of English as a Foreign Language, where it performed on par with the average for nonnative English speakers (Landauer & Dumais, 1997). The HAL model was evaluated by demonstrating that for a set of 20 words, each target word’s nearest neighbor (out of 70,000 possible neighbors) was generally a word that was very similar in meaning to the target word (Lund & Burgess, 1996). Additionally, in the HAL model multidimensional scaling was performed on a different set of semantic vectors from three different, superordinate, semantic categories, and the results indicated that the same vectors contain “categorical” semantic information. These and a number of other word similarity judgment tasks have remained common benchmarks for assessing the quality of semantic space models (Bullinaria & Levy, 2007; Fyshe et al., 2013; Rohde, Gonnerman, & Plaut, 2006). A number of similarity metrics have also been used for comparing vectors in semantic spaces; the Euclidean distance and the cosine similarity are among the most popular (see Rohde et al., 2006, for a detailed discussion).
Although word similarity judgment tasks have demonstrated that these semantic spaces faithfully capture human knowledge representation, developmental evidence suggests that they may also mimic the way in which children or adults learn these representations. The LSA model was shown to reflect the speed with which children learn words while reading (Landauer & Dumais, 1997). The HAL model was shown to simulate children’s semantic representations when applied to parental speech that the children receive (Li, Burgess, & Lund, 2000). Other models have shown plausible simulations of the acquisition of monolingual and bilingual representations based on the timing of when each language is learned (Li & Zhao, 2013; Li, Zhao, & MacWhinney, 2007; Zhao & Li, 2010) as well as the loss of the first language lexicon depending on age of onset of the second language (Zinszer & Li, 2010).
Many factors are known to affect the quality of semantic space models such as the dimensionality of the semantic space (Bullinaria & Levy, 2007; Landauer & Dumais, 1997; Lund & Burgess, 1996), the method of dimensionality reduction (Bullinaria & Levy, 2012; Rohde, Gonnerman, & Plaut, 2006), the size of the window used to count co-occurrences (Fyshe et al., 2013), and factors such as whether or not to exclude function words (stop lists) and removing grammatical endings (stemming; Bullinaria & Levy, 2012). The proper number of dimensions is, for example, determined empirically in most cases, with many researchers reporting a number around 300 despite using different dimensionality reduction methods (Landauer & Dumais, 1997; Lund & Burgess, 1996). However, these “magic” 300 dimensions do not always work well. Fyshe, Talukdar, Murphy, and Mitchell (2013) report peak performance with around 600 dimensions. Bullinaria and Levy (2007) report maximum performance around 1,000 dimensions. The authors of these two studies used PPMI instead of a raw co-occurrence frequency prior to dimension reduction and argued that PPMI may capture more semantic information in these higher dimensions than raw co-occurrence frequencies.
There are also many ways to determine window size—for example, by document boundaries, sentence boundaries, phrase boundaries, fixed word length boundaries, or by syntactic relationships after parsing the sentences. Fyshe, Talukdar, Murphy, and Mitchell (2013) compared models that used document boundaries (document vectors) with models that used syntactic dependencies (dependency vectors). Both vector types were formed using the dependency parsed ClueWeb09 corpus (Callan & Hoy, 2009; Fyshe et al., 2013; Hall et al., 2007). Each element of the document vectors for each word consisted of a single integer corresponding to the number of times a word appeared in a given document (indicated by the column number of the row vector). The elements of the dependency vectors consisted of single integers corresponding to the number of times a word appeared in a dependency parsed syntactic relationship with a particular word or phrase (again indicated by the column number of the row vector).
Thus, in the dependency vectors, multiple columns may have corresponded to the same word in the corpus, but not to the same word-dependency-parse pair. For example, the nouns couch and ball could conceivably appear in both agent and patient roles relative to one another. Take the following sentences, for example: “The ball hit the couch,” and “The couch supported the ball.” In this case, if the row vector in question corresponded to the target word couch, two columns could correspond to the word ball, a column feature in which ball was a patient and one in which ball was an agent. For another target word, say person, it is likely that the frequency count of the patient-ball column feature would be much higher than the agent-ball column feature, since people are more likely to act on the ball than vice versa—although balls can and certainly do hit people, too. It is also important to note that in Fyshe et al. (2013), some phrases were counted as “single units.” Thus, it was possible for a phrase to be the target “unit” corresponding to a row vector in either the dependency vectors or the document vectors, or even as part of a column feature in the dependency vectors. The example given in the original paper is “a 27-inch television,” where “a” modifies the noun phrase “27-inch television” (Fyshe et al., 2013). In the case of phrases, dependency relationships within a phrase being treated as a single word unit were not counted as separate column features. These co-occurrences, regardless of the way in which they were computed, were followed by transformations using PPMI and by dimension reduction methods such as singular value decomposition (SVD).
Finally, a third type of vector was formed in Fyshe et al. (2013) by concatenating the previous two after applying the SVD. It was found that the dependency and concatenated vectors generally performed the best on simulating word similarity judgment tasks. The authors also found qualitative differences between the nearest neighbors of words in each model. Nearest neighbors in the dependency vector model were generally synonyms with similar syntactic properties, whereas nearest neighbors in the document vector model tended to be more loosely, or “topically,” related.
Application of semantic space models to neuroimaging data analysis
More recently, researchers have applied semantic space models to the analysis of neuroimaging data. Mitchell et al. (2008) was one of the first studies to use computational models of word meaning to predict functional magnetic resonance imaging (fMRI) data. The fMRI data was collected while participants saw 60 nouns and were instructed to think silently about the meaning of the noun while each word was presented on a screen in the scanner. This study demonstrated that features like co-occurrence frequency were able to predict the blood oxygenation level dependent (BOLD) responses observed in fMRI data.
It is important to note that the model developed by Mitchell et al. (2008) differed from traditional models in several important ways. First of all, semantic dimensions were manually chosen depending on how relevant a given dimension was believed to be for a given word (e.g., “eat” for celery), not the result of any dimensionality reduction method. Second, there were only 25 semantic dimensions, significantly fewer than the 300 dimensions assumed to be optimal, as we discussed earlier. Furthermore, the numerical values of each dimension were the raw co-occurrence frequency counts after each word vector with 25 dimensions had been normalized to unit length. The motivation for such a model was likely the small number of data points and the need for a relatively high prediction accuracy of the model. It could be difficult for a computational model to learn the mappings from 300 semantic dimensions to the BOLD response observed in a given brain area with the limited number of data points typically acquired in a single neuroimaging experiment. This difficulty is due to the time, cost, and physiological constraints inherent in fMRI experimental design, as well as the need to average over several trials to increase signal to noise ratio. Since Mitchell et al.’s seminal article, a number of studies have presented methods that improved the original model’s performance and extended the model to more general contexts. Bullinaria and Levy (2013) achieved the best prediction accuracy to date, and addressed important issues in the assessment of accuracy in Mitchell et al. (2008). The authors used PPMI instead of raw co-occurrence frequency counts. However, they also based their semantic vectors on a different corpus, so it was unclear whether it was the PPMI method or the corpus that contributed to the model improvement.
Another study by Murphy, Talukdar, and Mitchell (2012) compared a large number of models in their ability to predict brain activation for concrete nouns using the original data set from Mitchell et al. (2008). The researchers examined the effect of dimensionality on model accuracy, but only using two different numbers of dimensions, 300 and 1,000. They also investigated the effect of frequency cutoffs for co-occurrence counts. Despite examining a large number of models, the authors confounded several differences between each pair of models, and the differences in model performance could not be linked to specific variables. Additionally, they employed a regularized regression method with 300 and 1000 dimensions, reporting improved performance for the vectors with 1,000 dimensions when compared to vectors with 300 dimensions. Regularized regression models are capable of solving regression equations that have a greater number of predictor variables than data points by shrinking the values of poor predictors to zero, and then only solving for a subset of the predictors. However, this improved performance came at the cost of the lack of transparency of the semantic features. In the original study by Mitchell et al. (2008), the semantic features were very clearly related to perceptual and motor experiences pertaining to concrete objects (“taste,” “see,” “run,” etc.), which was reflected in the mappings of the features onto different brain regions. When using computationally derived semantic features that are the result of SVD, for example, it is impossible to say exactly what those features represent.
Although an increasing number of studies have shown that it is possible to decode semantic information from fMRI data, very few studies have used semantic space models to test specific hypotheses of semantic representation in the brain. Previous methods have been primarily concerned with the accuracy of the algorithms, but have confounded potentially interesting differences between the models in terms of how different models might relate to psycholinguistic theories. Still, a small number of studies have, in fact, taken a more theoretical approach in relating computational models to psycholinguistic theories of semantic representation (Anderson, Bruni, Lopopolo, Poesio, & Baroni, 2015; Crutch, Troche, Reilly, & Ridgway, 2013).
Anderson et al. (2015) investigated how different brain regions process mental imagery when reading text. They attempted to categorize macrostructures as primarily semantic or visual processing regions by comparing different brain regions to two different computational models, one based on visual features and another based on word co-occurrences. In the case of the fMRI data, brain regions in the core language network and occipital lobes can be roughly categorized by the basic information type (semantic or visual) that is processed in that region—although there may be some instances of considerable overlap at the macrostructural level. In the computational models, the basic information type was known and completely determined by the researchers, with no overlap between the two models. Thus, the computational models acted as “ideal” systems that processed only semantic or only visual information.
They compared the representational similarities between all pairs of words in different brain regions to those predicted by the two computational models. For example, the distances between concepts like cat and tiger or dog and wolf from a visual information processor might suggest that dog and wolf are extremely close in terms of their representational distances, whereas dog and cat are farther apart. However, a semantic information processor might place dog and cat closer to each other than dog and wolf. The hypothesis in this study is that, if a particular brain region predominantly processes one type of information relative to the other, then the observed pattern of pairwise distances should correlate more highly with those from the corresponding computational model. The results partially support this prediction, in that they do seem to capture a broad distinction between occipital regions and temporo-parietal/frontal regions in the core language network. However, they report null results in language regions such as the superior temporal gyrus and supramarginal gyrus.
Narrow versus coarse semantic processing
One hypothesis that could be informed by exploring the relationship between computational models and regional differences in semantic representation is the debate on coarse semantic processing in the right hemisphere (Beeman, 1998; Burgess & Simpson, 1988; Jung-Beeman, 2005). The coarse coding hypothesis posits that the left hemisphere (LH) is responsible for processing a narrower, or central, sense/feature of word meaning, while the right hemisphere (RH) is involved in processing coarse semantic information, or meaning that is only loosely related to a given word being processed, such as establishing relations between weak or disperse semantic connections, organizing pragmatic information, or reinterpreting language stimuli. This hypothesis characterizes RH semantic information as being more coarse than semantic representations in the LH because the RH is thought to activate more widespread semantic networks than the LH even at the single concept level (Jung-Beeman, 2005). Therefore, individual concepts are less specific; some of the discriminating factors between concepts are forfeited in exchange for access to a larger semantic network. Coarse semantic processing is thought to play an important role in inference making (Beeman, 1993; Beeman, Bowden, & Gernsbacher, 2000), metaphor processing (Yang, 2014), reading (St. George, Kutas, Martinez, & Sereno, 1999; Sandak, Mencl, Frost, & Pugh, 2004), idiom processing (Yang et al., 2016), and insight (Bowden & Beeman, 1998).
Coarsely and narrowly related word pairs have previously been operalationalized in terms of homographs that have clear subordinate and dominant meanings. For example, the word pan could be paired with the word pot or with the word money (as in panning for gold; see Burgess & Simpson, 1988): The narrow association is pot and pan, and the coarse association is pan and money. The theory posits that in the RH the representation of pan would be more similar to the representation of money than in the LH. The theory also suggests that in the RH the representation of pan is also less similar to the representation of pot than in the LH.
Early behavioral studies linked coarse semantic processing to the RH using hemi-field experiments, in which stimulus information was manipulated to enter the brain through either the right or left hemisphere by presenting stimuli exlusively to the left or right visual field, respectively (Atchley, Keeney, & Burgess, 1999; Burgess & Simpson, 1988; Tompkins, Fassbinder, Scharp, & Meigh, 2008; Faust, Ben-Artzi, & Harel, 2008). Other methods have examined individuals with RH brain damage (Tompkins, Fassbinder, et al., 2008; Tompkins, Scharp, Meigh, & Fassbinder, 2008). With the advent of neuroimaging, researchers have been able to make finer spatial distinctions concerning the different functions that may be performed in different brain regions in both the RH and the LH. In particular, coarse semantic processing has been associated with the right middle temporal gyrus, the right superior temporal gyrus, and the right inferior frontal gyrus (Jung-Beeman, 2005), whereas narrower semantic processing is thought to occur primarily in homologous regions in the LH.
Evidence for the existence of this regional distinction in the brain has, thus far, only been indirect. Neuroimaging evidence has relied on the assumption that certain tasks, such as inference making, require greater coarse semantic processing (Virtue, Haberman, Clancy, Parrish, & Jung-Beeman, 2006), and behavioral priming evidence has been inconsistently associated with the RH. However, it is possible that certain parameters in semantic space models have a theoretical relationship with the coarseness and and narrowness of semantic information. For example, Fyshe et al. (2013) demonstrated that the nearest neighbors of a given word in dependency models could be more successfully used as substitutes for the target word than the nearest neighbors of the same word in the document models. This can be seen in the three neareast neighbors of the target word “beautiful” in the dependency model: “wonderful,” “lovely,” and “excellent” and contrasted with those from the document model: “wonderful,” “fantastic,” and “unspoiled.” Similarly, given the target word “dog,” the dependency model’s three nearest neighbors were “cat,” “dogs,” and “pet,” whereas the document model’s three nearest neighbors were “dogs,” “vet,” and “leash.”
Other computational studies report similar dichotomies describing how the parameterization of semantic space models can affect the nature of the meaning captured by a particular model. Two such dichotomies are substitutability versus associativity (Rubin, Kievit-Kylar, Willits, & Jones, 2014) and paradigmatic versus syntagmatic (Jones, Willits, & Dennis, 2015; Sahlgren, 2006). Both paradigmatic and substitutability refer to semantic similarity relationships in which words are more similar when they can be used in place of one another in a sentence. Such relationships frequently capture syntactic frames and part-of-speech information in addition to the similarity between the referents of two words. These ideas are consistent with our notion of narrow semantic information. By contrast, syntagmatic or associative relationships, much like coarse semantic information, generally capture words that have less similar referents but that are likely to occur together in speech. The syntagmatic versus paradigmatic and the substitutability versus associative relationships have been largely discussed in terms of computational models, whereas coarse and narrow semantic information emerged mainly from behavioral studies and later was applied to neuroimaging data.
Another possible parameter that might bear a theoretical relationship to coarseness and narrowness is dimensionality. One previous study reported improved decoding accuracy of the fMRI data from Mitchell et al. (2008) using vectors with higher numbers of dimensions (Murphy et al., 2012). Although higher numbers of dimensions have previously been thought of as introducing noise into the vector representations, the observed degradation of the word similarity structure may actually be caused by the increased number of dimensions creating coarser semantic representations. If the goal is to capture fine-grained similarity strucute in word meaning, then these extra dimensions may be seen as noise. However, if the goal is to predict fMRI data, then they may better account for the variation in activity in some brain regions, specifically those in the RH semantic network.
The present study
Given the above considerations, the present study uses the vectors created in Fyshe et al. (2013; downloadable at https://www.cs.cmu.edu/~afyshe/papers/conll2013/). to systematically vary different semantic space models in terms of dimensionality, word ambiguity, and model type, and to examine their ability to predict brain activation for concrete nouns using the data from Mitchell et al. (2008; downloadable at https://www.cs.cmu.edu/afs/cs/project/theo-73/www/science2008/data.html). The goal of the present study is twofold. The primary goal is to provide a large-scale exploratory analysis of how different types of computational space models relate to different regions in the language network, in an effort to fill in a gap in our knowledge about how model parameters might relate to different types of semantic information in the brain. Second, we aim to specifically examine the effects of dimensionality, word ambiguity, and model type (dependency vs. document models) in the bilateral middle and superior temporal gyri and in the bilateral inferior frontal gyri. These regions are thought to be the most significant contributors to semantic processing (e.g., Binder & Desai, 2011), and the RH regions have been linked to coarse semantic processing (Jung-Beeman, 2005). For this latter analysis, we predict that we will find a complex semantic structure across the language network, such that there will be considerable regional variability in which types of model parameters affect how computational space models account for the variance in activity in a given region. However, we also expect that patterns consistent with current theories of semantic representation will emerge. Finally and most important, as for the coarseness and narrowness distinction, we predict that document models and vectors with higher numbers of dimensions will better correlate with patterns of voxel activity in the RH semantic network, whereas dependency models and vectors with lower numbers of dimensions will correlate more highly with the patterns of voxel activity in the LH semantic network.
The goals of Study 1 were to systematically explore the effects of dimensionality and model type on the accuracy of predicting fMRI data for concrete nouns and to validate that our algorithm captures semantic information at a level above chance. This study closely follows the original procedures used by Mitchell et al. (2008) for testing a given set of semantic features’ ability to predict fMRI data for concrete nouns using leave-two-out cross-validation to train and test the model.
The semantic vectors used in the present study were created on the basis of the semantic vectors from Fyshe et al. (2013). The vectors had already been preprocessed upon download. Each vector consisted of 2,000 dimensions, but the first 1,000 and second 1,000 dimensions were created using different models and then concatenated together. The two models were the dependency model and the document model described earlier. These two models primarily differed in how they counted co-occurrences.
The document model counted the number of times a word appeared in a document, such that the integer value in each column stood for the number of times a word appeared in a document. Each column corresponded to a unique document. Every word was represented by a unique row vector with the same number of columns standing for the same documents. Thus, two words co-occurred if they both had values greater than one in the same column. The dependency vectors, on the other hand, counted co-occurrences on the basis of dependency relationships with other words produced by using an automatic sentence parser (Hall et al., 2007). Thus, the column features for each of the words/row vectors in the dependency model were integers indicating the number of times the target word appeared in a certain dependency relationship with all other words. Thus, if two different row vectors both had an integer value greater than zero in the same column, this means the two words to which these row vectors corresponded both appeared in the same dependency relationship with the same word. For example, the words door and ball, could both appear in the patient relationship relative to the word kicked if they occurred in sentences describing that they were both kicked by some agent.
All raw co-occurrence frequencies were transformed using PPMI. These two distinct data matrices were then reduced to 1,000 dimensions using SVD. At this point, the matrices were concatenated horizontally. In each of the matrices, the same row numbers correspond to the same words. So, each word has a 2,000 dimension representation (2, 000 column features). To isolate only the document representation or only the dependency representation, either the first or second 1,000 dimensions were extracted. To look at effects of dimensionality, only the first 150, 300, 600, or all 1,000 dimensions of each of the vector types were used. In the case of concatenated vectors, the dimensionality was doubled.
In the Fyshe et al. (2013) vectors, each row vector corresponds to a word and part-of-speech pair, or a part-of-speech tagged phrase. Thus, some target words we were interested in had multiple vector representations depending on whether or not the word appeared in a phrase or had multiple parts of speech. For example, “bear” had a vector as a verb and another as a noun. So, we included one final manipulation in how we calculated the final vectors for each of our target words, in the following three ways: (1) vectors that were the averages of all single word vectors that corresponded to the target word or phrase vectors whose corresponding phrase began with the target word, regardless of the part of speech; (2) vectors that were the averages of all single word vectors corresponding to the target word, regardless of part of speech but excluding phrase vectors; and (3) vectors that were the average of all single word vectors corresponding to the target word and were the appropriate part of speech (noun or verb). These three types of vectors reflect different ambiguity constraints and represent a narrowing of the specificity of the information represented, from the least specific (the first type, in which phrasal and word vectors with different parts of speech were included) to the most specific (the third type, in which only word vectors with appropriate parts of speech were included). However, phrase vectors could also be seen as more specific than single word vectors in the sense that phrases often have more constrained usage than a single word. As for whether to include all or only specific parts of speech, it is less ambiguous to use only vectors for a single part of speech, as the meanings of words with multiple parts of speech may be very distantly related.
From these data, we extracted vectors for the 60 nouns and 25 verbs used in Mitchell et al. (2008; see the supporting online materials). All of the nouns were concrete objects. There were 12 categories with five nouns in each category. The verbs were manually chosen by the researchers and are comprised of actions associated with the concrete objects (see Mitchell et al., 2008, for details). For each noun and each verb, we created 36 different vectors (3 Ambiguity Constraints × 3 Vector Types × 4 Dimensionalities). This implies that for each vector type (dependency, document, concatenated), we created three types of vectors based on different ambiguity constraints, as we discussed above, and each of these nine vectors were also sliced at four different dimensionalities (150, 300, 600, and 1,000 dimensions, or double these in the concatenated case).
Mitchell et al. (2008) only used vectors with 25 dimensions for each noun, where each dimension represented the co-occurrence count among 25 manually selected verbs for the noun in question. These 25 semantic dimensions resulted in interpretable dimensions that provided intuitive understandings of the words. For example, the word celery would have a high loading on the dimension that was the co-occurrence with the word eat. Such transparency is not possible with the latent dimensions that are produced in traditional vector space models. Furthermore, the dimensions of the two types of the dependency and document models were based on very different criteria. To make the two models more directly comparable, we put all of the models in the same “scale,” such that each of the 60 nouns was represented by a row vector with 25 column features, which corresponded to semantic dimensions relating to the 25 verbs. Thus, any differences between the performance of the models could be attributed to the variables of interest, while controlling for the nature of the semantic features and maintaining the original transparency of the semantic features in Mitchell et al. (2008). To create this structure with the extracted vectors, we performed a second stage of dimension reduction for each of the 36 sets of vectors. For any given set of vectors, regardless of the number of dimensions after the first stage of dimensionality reduction (150/300, 300/600, 600/1,200, or 1,000/2,000), the final vectors contained only 25 dimensions for all 36 sets. Instead of using co-occurrence counts in these dimensions as in Mitchell et al. (2008), we used the cosine similarities between a given noun vector and the 25 verb vectors to produce the final set of semantic features for each of the 60 nouns.
To make this more concrete, we provide here an example of the creation of one set of vectors from start to finish. To create the vectors for the document model with 300 dimensions and an ambiguity constraint that allowed for phrases and nouns regardless of part of speech, we first collected all of the vectors corresponding to all of the 60 nouns and 25 verbs, without controlling for part of speech and including phrase vectors that began with a given noun or verb. Since there were multiple vectors for a single noun or verb, we averaged the vectors. At this point, the vectors still had 2,000 dimensions. Then, we removed the 1,000 dimensions corresponding to the dependency model. Then, we selected only the first 300 document dimensions of the remaining 1,000 for each averaged noun and verb vector. Finally, we took the cosine similarities between each noun vector and all of the verb vectors to create a final set of 60 row vectors (corresponding to the 60 concrete nouns) with 25 columns (dimensions or semantic features). Although each model contained the same number of dimensions, the cosine similarities differed across each model because they depended on the values in the vector, which in turn depended on the vector types, ambiguity constraints, and number of dimensions.
The fMRI data contain nine participants’ functional fMRI data from a 360-trial experiment in which each of 60 nouns was presented six times, each in a fast event-related design. The online data are already preprocessed and contain estimates of the peak of the hemodynamic response function for each trial for each voxel in the brain in a MATLAB structural array. In our analyses, each of the nine participants’ fMRI data were reduced from 360 vectors to 60 by averaging the six trials of each noun. From each of these 60 vectors, we also subtracted the grand mean of all of the vectors. We selected 500 voxels exactly as is described in Mitchell et al. (2008). Thus, the fMRI data used in Study 1 consisted of 60 vectors each with 500 voxels.
The analyses evaluated the ability of a partial least squares (PLS) regression model to decode brain activity. We used leave-two-out cross-validation in which the regression model was trained on 58 of the 60 nouns, as in Mitchell et al. (2008). The design matrix comprised of 58 rows, corresponding to the 58 concrete nouns left in the training set, and 500 columns, which were the average voxel activity at the 500 different selected voxels for the 58 different nouns. The dependent variables were the 25 semantic features (columns) of the 58 nouns (rows), whose calculations were described in the previous section. The 500 selected voxels were reduced to 13 components in the PLS regression model. This number was chosen on the basis of an analysis of the accuracy of Mitchell’s original predictors (the normalized raw co-occurrence frequencies of the nouns and verbs) and the predictor variables we created.
The mappings learned from the 500 voxels were then used to predict the semantic feature vectors for the remaining two left-out words. The model was then tested on two possibilities: (a) The model was said to have correctly discerned between the two left out words, if the predicted vectors for the two left out words were more similar to the correct actual vectors (i.e., the cosine similarities between the predicted vector for Word 1 and the actual vector for Word 1 and the predicted vector for Word 2 and the actual vector for Word 2) than to the two incorrect actual vectors (i.e., the cosine similarity between predicted vector for Word 1 with the actual vector for Word 2, etc.); (b) the model was said to have incorrectly discerned between the two left-out words, if the reverse occurred. Accuracy was determined by dividing the number of correct trials by 1,770, the total number of ways to choose two words to be left out. We repeated this analysis for all 36 of the models, and averaged the performance across vector types and ambiguity constraints. A model was considered to perform better than chance at 62 % accuracy according to a null distribution calculated in the original study by Mitchell et al. (2008). The null distribution was based on random sets of semantic features, so a model that achieves above 62 % accuracy can be thought to capture more semantic information in the brain than would be expected by chance.
The results show that, despite reducing all of the models to 25 dimensions to match the number of dimensions used by Mitchell et al. (2008), there was considerable variability in how the models performed in terms of prediction accuracy. Additionally, most of the models performed at above-chance accuracy, with the exception of all nondocument high-dimensional models and models using very few components. One important question is whether different models capture different types of semantic information, or whether all of the models are capturing more or less of the same semantic information. Evidence for “different semantic information” could be reflected in improved prediction in different brain areas or accounting for different portions of the variance in the same brain area. In this study, we are particularly interested in the former case. We believe these results are suggestive of the possibility that different brain regions may represent semantic information in slightly different ways.
Perhaps the most surprising result is the different pattern of activity observed for the document model relative to all of the models. Although the other models do differ in performance, the overall pattern is relatively consistent. Only the document models show an inrease in performance at 600 and 1,000 dimensions relative to 150 and 300 dimensions. As we discussed in the introduction, the nearest-neighbor analysis by Fyshe et al. (2013) demonstrated that the document model tended to group together words whose meanings were more distant from one another, as compared with the dependency model. In many ways, the document model seems to produce semantic associations that are at least partially consistent with theories about coarse coding in the RH. However, whether there are spatially meaningful relationships between different brain regions and different semantic space models is a question that has received little attention, especially as it pertains to the coarse-coding hypothesis, which will be examined later (see Study 2 for details below).
Although Study 1 is suggestive of the role of different types of semantic information in predicting brain activation in fMRI data, the selection of voxels used by Mitchell et al. (2008) was not theoretically motivated. Instead, voxel selection was entirely data-driven, so it is not possible to make spatially specific inferences from the results of Study 1. In the next study, we wanted to assess the underlying brain networks for different types of semantic information—namely, coarse versus narrow semantic information, where we define coarse semantic information as being loosely related to a word’s core meaning, and narrow semantic information as being closely related to a word’s core or primary meaning. Although indirect behavioral, computational, and fMRI evidence exists in support of the psychological distinction between these types of semantic information (see discussion in the introduction), no evidence has shown that neuroimaging data can be directly related to independently generated representations of each type of semantic information. Establishing a direct link between these distinct forms of semantic information in computational and neuroimaging data would provide the most direct evidence of the neurocognitive reality of theories for the narrow versus coarse semantic-coding processes.
In Study 2, we sought to answer the following research question: Does the RH semantic network show higher similarity to computational models that have coarse semantic properties than the LH language network? To assess this question, we operationalize coarseness in terms of the semantic models in the following ways. First, we assume that document models produce coarser semantic information than dependency models. Second, we assume that models that have greater word sense ambiguity (i.e., models that include phrases and/or multiple parts of speech) produce coarser semantic information than models that are constrained to specific parts of speech. In fact, this definition is quite similar to the definitions used in early behavioral experiments that primed homographs with dominant and subordinate associations (Burgess & Simpson, 1988). Finally, we also explore the possibility that models with higher numbers of dimensions produce coarser semantic information than models with lower dimensionality. This final assumption is based on the document by dimensionality interaction observed in Study 1 (see Fig. 2) and by recent debates revolving around whether or not higher dimensions in dimensionality reduction methods capture noise or actual semantic structure (Murphy et al., 2012).
The methods in this study are based on the methods used by Anderson et al. (2015). The main assumption behind these methods is that the semantic structure of a particular brain region or a computational model can be summarized by a correlation matrix of the word representations. For a brain region, this correlation matrix is the correlation of the average pattern of voxel activity for all pairs of words, 60 × 60 in the case of the data from Mitchell et al. (2008). For a computational model, this correlation matrix is the pairwise correlation of all of the values of the semantic features (in our case the cosine similarities between the verbs and nouns). The correlation matrices produced by different semantic models can then be compared with respect to their correlations with the matrix of brain correlations. If one model consistently correlates higher than another model with one brain region across participants, then it is arguable that this particular model represents words more similarly to that particular brain region than another model. The models and the fMRI data were the same in this study as in Study 1.
Brain regions were selected according to a meta-analysis of the extended language network (Ferstl, Neumann, Bogler, & von Cramon, 2008), which identifies coordinates that represent the point of most likely activation for fMRI contrasts that correspond to different levels of language processing. The study examined four fMRI contrasts: language versus rest, language versus nonlinguistic stimuli (including pseudolinguistic stimuli), coherent versus incoherent, and a more heterogeneous group of contrasts related to integration of semantic information. The voxel coordinates in Ferstl et al.’s study were reported in Talaraich coordinates; however, the data from Mitchell et al. (2008) were in Montreal Neurological Institute 152 (MNI152) coordinates. We used a MATLAB function called tal2mni to convert the Talaraich coordinates in the Ferstl et al. study to MNI coordinates, and we used the Anatomical Automatical Labeling atlas to label the region of each voxel. We only looked at regions that were in the bilateral inferior frontal gyri, the bilateral superior temporal gyri, and the bilateral middle temporal gyri. The Appendix lists the original Talaraich coordinates and converted MNI coordinates as well as their labels.
We considered three different ways to select voxels that would correspond to a region with the reported voxels at the center, each including a larger number of voxels than the previous one. First, we considered a 5 × 9 × 9 prism with a group of eight voxels surrounding the central voxel, such that it was 9 mm high, 9 mm wide, and 5 mm thick. The next largest voxel cluster was a 15 × 9 × 9 prism, similar to the previous one but 15 mm thick. The final voxel cluster was a sphere with a 15-mm diameter (five voxels in the x and y directions and three voxels in the z direction) sphere. We found a general trend that larger regions produced better results, and therefore in the remainder of the study, we only report results from the largest voxel clusters.
There was one correlation matrix for every brain region and one correlation matrix for each of the 36 models. Then, correlation matrices that were created on the basis of the voxel activity from different points in the same brain region were averaged together. These were then averaged across participants. Finally, these correlation matrices were converted to single vectors with 3,600 (60 × 60) dimensions. Each of the correlation vectors corresponding to each of the 36 models was correlated with each of the averaged correlation vectors for the six different brain regions of interest, producing 216 data points (36 per region).
The 36 correlations for each of the LH brain regions were subtracted from the RH brain regions. In the end, we had three vectors of 36 values, corresponding to the difference in correlations between the right and left inferior frontal gyri, the right and left superior temporal gyri, and the right and left middle temporal gyri. Values greater than 0 indicated that a particular model correlated more highly with the RH brain region, and negative values indicated that a particular model correlated more highly with the LH brain region.
We conducted three statistical tests on each of these vectors (nine tests total). Thus, the Bonferroni-corrected significance cutoff for an alpha of .05 is approximately .0055. First, for each bilateral set of homologous brain regions, we tested whether or not the average difference in correlation with the 12 document models was greater than the average difference in correlation with the 12 dependency models using a one-sided paired t test. Second, we tested whether there was a linear effect of dimensionality using linear regression. We coded the regression design matrix with 36 rows and two columns using the values 1, 2, 4, and 6.66 for the different values of the dimensions in the second column (150, 300, 600, and 1,000 divided by 150, or equivalently, 300, 600, 1,200, and 2,000 divided by 300 for the concatenated vectors). The first column was a column of all 1 s, for the intercept. We regressed the difference in region–model correlation values between right and left homologous brain regions onto this design matrix, and conducted a two-sided t test on the significance of the beta coefficient for the slope. Finally, we compared the difference in region-model correlations between right and left homologous brain regions according to different ambiguity constraints using a one-way analysis of variance (ANOVA), coding the groups categorically as using (a) all parts of speech and phrases, (b) all parts of speech and no phrases, and (c) no phrases with only appropriate parts of speech.
One-tailed, paired t test: Difference in model type correlations in the bilateral superior temporal gyri
Mean Corr. Diff.
Prob > t
One sample, paired t test: Difference in model type correlations in the bilateral inferior frontal gyri
Mean Corr. Diff.
Prob > t
One-way ANOVA: Correlation differences between left versus right middle temporal gyri
Prob > F
These effects are partially consistent with our predictions regarding the theoretical relationships between ambiguity constraints, documents and dependencies, and coarse semantic coding. Both model type and word ambiguity were significant predictors of how the correlation matrices of word representations in the RH and LH correlated with word representation correlation matrices generated by computational models. However, no significant linear effects of dimensionality were found that could explain these differences. Our results provide convergent evidence of the coarse coding hypothesis and its laterality predictions (Beeman, 1998; Jung-Beeman, 2005). Finally, our results also validate the meta-analysis on the extended language network (Ferstl et al., 2008), suggesting that it provides a useful map for locating semantically relevant brain regions in MNI152 space, despite some inconsistency due to the need to convert the coordinates from Talaraich space.
In this article, we have reported two studies using computational semantic space models to identify brain representations of word meanings. In the first study, we investigated the effect of dimensionality, ambiguity constraints, and model types on the accuracy of semantic space models’ prediction accuracy for fMRI data. Our findings indicated that at most dimensionalities, models that use document windows outperformed both models based on dependency co-occurrence statistics and hybrid, concatenated models. Furthermore, our analyses showed that models that control for the part of speech outperformed models based on single-word representations that include all parts of speech and phrasal information.
Our data from Study 1 also indicated a model type by dimensionality interaction. Both the dependency models and the concatenated models tended to perform better at lower numbers of dimensions, around 150 and 300. This is consistent with what has been reported in semantic space modeling research suggesting that the “magic” optimal dimension is around 300 (Landauer & Dumais, 1997; Lund & Burgess, 1996). However, the document models performed best at higher numbers of dimensions peaking at 600 but also maintaining good accuracy at 1,000 dimensions. Previous research on predicting fMRI data with semantic space models also found better performance at higher numbers of dimensions using different methods (Murphy, Talukdar, & Mitchell, 2012). This interaction suggested that despite using very similar semantic features across models, there were differences in the semantic information that was captured by each model.
In Study 2, we examined whether the differences in the semantic information captured by these models could be linked to different brain regions in the right and left hemispheres that had been previously linked to different types of semantic processing, especially with regard to the narrow versus coarse coding for LH versus RH, respectively (e.g., Jung-Beeman, 2005). We found consistent differences across nine participants in how different semantic space models correlated with the representation of concrete nouns in the bilateral inferior frontal gyri, the bilateral superior temporal gyri, and the bilateral middle temporal gyri. These regions in the RH are theorized to play an important role in coarse semantic processing, whereas the LH regions are thought to represent semantic information more narrowly (i.e., more fine-grained linguistic information). On the basis of previous work and the results of Study 1, we predicted that coarser information would be captured by (a) models that use document co-occurrence statistics instead of dependency parsed co-occurrence statistics, (b) models that include more ambiguity in the single-word representations, and (c) models with higher numbers of dimensions. We found preliminary evidence for the first two predictions but not the last. To our knowledge, this is the first time coarse semantic information has been linked to specific computational model parameters at a neurological level.
An implication of the findings in our study is that the same piece of semantic information may look subtly different in different brain regions. All of our models used the same semantic features that corresponded to actions that we perform on concrete objects (Mitchell et al., 2008). Despite this, there was systematic variance in how different models correlated with different brain regions. It suggests that different brain regions in the language network may be fine-tuned to certain types of semantic associations and that these associations are simultaneously active even when processing single words outside of any meaningful context.
One limitation of this study is that there is no straight-forward relationship between the available computational model parameters and coarse versus narrow semantic information. Previous work validated the existence of coarse versus narrow semantic processing of meaning both computationally (Jones, Willits, & Dennis 2015; Rubin et al., 2014; Sahlgren, 2006) and neuropsychologically (Atchley et al., 1999; Beeman, 1998; Burgess & Simpson, 1988; Jung-Beeman, 2005; Tompkins, Fassbinder, et al., 2008; Tompkins, Scharp, et al., 2008; Virtue et al., 2006). Our results suggest some degree of similarity between the computational models and the brain in how these distinct types of meaning are represented. Future computational research needs to establish a theory and practice for developing and evaluating semantic space models that mimic behavioral and neurological findings related to coarse semantic information. Although coarse semantic information has received considerable attention in the cognitive-psychology and neuroimaging literatures, it has yet to undergo the rigorous research regarding how it may be implemented computationally in a way narrow semantic information has seen over the past decades.
In general, using computational models in conjunction with neuroimaging data can provide independent and convergent evidence for debates that center around semantic or conceptual representations in ways that complement theories based on neuroimaging data alone, or theories that only look at brain-behavior correlations. Theories of representation relying on neuroimaging evidence alone suffer from the problem of multifunctionality of most brain regions. It is, in most cases, extremely difficult to control for the entire range of variables in an experiment that uses language as stimuli. Thus, it is difficult to pinpoint what kind of information a given brain region might be processing on the basis of only the degree or level of activity or activation. This problem can be mitigated through brain-behavior correlations, but behavior itself still has a one-to-many mapping from a particular behavior to how that behavior might be implemented at the neural level. To overcome this problem, computational models such as the ones used in Anderson et al. (2015) and in our present study can be used to make specific predictions at a representational level. If it can be shown that one computational model better captures the variation in fMRI for a particular set of stimuli, then it is arguable that this computational model provides a better account of how a particular brain region actually represents those stimuli.
A significant challenge facing researchers in recent years is how to build bridges between computational modeling results and neuroimaging findings. Models based on computational data and models based on neuroimaging data are rapidly growing. There is an urgent need for the integration of the two, and more specifically for more and improved data sets on which to evaluate how semantic space models relate to fMRI data. The only data set to date has only nine participants, some of whose data show considerable noise due to movement (Mitchell et al., 2008). Additionally, Bullinaria and Levy (2013) have suggested an improved set of words that have better clustering properties than the original set. We believe that an improved data set would be a fruitful endeavor for open research and may allow researchers to ask many new and important questions about how the brain processes semantic information.
We thank Tom Mitchell and Alona Fyshe for making their data available and for promoting open research. The modeling work was conducted with the Advanced CyberInfrastructure computational resources provided by the Institute for CyberScience (http://ics.psu.edu) at Pennsylvania State University. This research was supported by a grant from the National Science Foundation (NCS-FO#1533625) to P.L.
- Beeman, M. (1998). Course semantic coding and discourse comprehension. In M. Beeman & C. Chiarello (Eds.), Right hemisphere language comprehension: Perspectives from cognitive neuroscience (pp. 255–284). Hillsdale, NJ: Erlbaum.Google Scholar
- Callan, J., & Hoy, M. (2009). The ClueWeb09 Dataset. Available from lemurproject.org/clueweb09/
- Crutch, S. J., Troche, J., Reilly, J., & Ridgway, G. R. (2013). Abstract conceptual feature ratings: The role of emotion, magnitude, and other cognitive domains in the organization of abstract conceptual knowledge. Frontiers in Human Neuroscience, 7, 186. doi:10.3389/fnhum.2013.00186 CrossRefPubMedPubMedCentralGoogle Scholar
- Fyshe, A., Talukdar, P., Murphy, B., & Mitchell, T. (2013, August). Documents and dependencies: An exploration of vector space models for semantic composition. Article presented at the International Conference on Computational Natural Language Learning (CoNLL), Sofia, Bulgaria.Google Scholar
- Hall, J., Nilsson, J., Nivre, J., Eryigit, G., Megyesi, B., Nilsson, M., & Saers. M. (2007, June). Single malt or blended? A study in multilingual parser optimization. Article presented at the CoNLL Shared Task Session of EMNLP-CoNLL, Prague, Czech Republic.Google Scholar
- Jones, M., Willits, J., & Dennis, S. (2015). Models of semantic memory. In J. R. Busemeyer, Z. Wang, J. T. Townsend, & A. Eidels (Eds.), The Oxford handbook of computational and mathematical psychology (pp. 232–254). New York, NY: Oxford University Press.Google Scholar
- Li, P., Burgess, C., & Lund, K. (2000). The acquisition of word meaning through global lexical cooccurrences. In E. V. Clark (Ed.) Proceedings of the Thirtieth Stanford Child Research Forum (pp. 167–178). CSLI: Stanford, CA.Google Scholar
- Murphy, B., Talukdar, P., & Mitchell, T. (2012, June). Selecting corpus-semantic models for neurolinguistic decoding. In Proceedings of the First Joint Conference on Lexical and Computational Semantics (*SEM) (pp. 114–123). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
- Rohde, D., Gonnerman, L., & Plaut, D. (2006). An improved model of semantic similarity based on lexical co-occurrence. Communications of the Association for Computing Machinery, 8, 627–633.Google Scholar
- Rubin, T., Kievit-Kylar, B., Willits, J., & Jones, M. (2014). Organizing the space and behavior of semantic models. In M. Knauff, M. Pauen, N. Sebanz, & I. Wachsmuth (Eds.), Cooperative minds: Social interaction and group dynamics. Proceedings of the 35th Annual Meeting of the Cognitive Science Society (pp. 1329–1334). Cognitive Science Society: Austin, TX.Google Scholar
- Sahlgren, M. (2006): The word-space model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces (Unpublished doctoral dissertation). Department of Linguistics, Stockholm University.Google Scholar
- Tomar, G., Singh, M., Rai, S., Kumar, A., Sanyal, R., & Sanyal, S. (2013). Probabilistic latent semantic analysis for unsupervised word sense disambiguation. International Journal of Computer Science Issues, 10, 1694–0784.Google Scholar
- Zinszer, B., & Li, P. (2010). A SOM model of first language lexical attrition. In S. Ohlsson & R. Catrambone (Eds.), Proceedings of the 32nd Annual Conference of the Cognitive Science Society (pp. 2787–2792). Austin, TX: Cognitive Science Society.Google Scholar