1 Introduction

What does it mean to understand text? The problem of meaning has occupied philosophers from the beginning of systematic speculative thinking (Landauer 2011). Any AI attempting to extract meaning from written texts would at a minimum need to distinguish between different kinds of works. Recognition of the genre, style, and content of texts constitutes a first step. There is already a sizeable literature on “natural language processing” (NLP), or automated extraction of information from text. Others have surveyed this literature (Foltz 1998; Gomaa and Fahmy 2013; Mikolov et al. 2013a, b; Shiffrin and Börner 2004), and no effort will be made here to catalogue every approach. Examples include “probabilistic topic models” such as latent Dirichlet allocation (LDA) that use Bayesian methods to extract both topics and structural relationships from underlying bodies of text (Steyvers and Griffiths 2011; Blei 2012), and neural networks that can be applied to a variety of NLP tasks (Collobert and Weston 2008). Modern “stylometry”, the study of linguistic style, applies statistical methods to questions of authorship attribution.Footnote 1

Perhaps the simplest and one of the best-established NLP system is “latent semantic analysis” (LSA). LSA can be implemented with off-the-shelf software and ordinary computational hardware.Footnote 2 It is scalable, and has a record of success in autonomous learning, essay grading, diagnosing schizophrenia, and information retrieval.Footnote 3 LSA has been used for similarity analysis of the titles of scientific papers, to show a decline in international cooperation and research productivity after 1914 (Iaria et al. 2017). Computer systems with these capabilities are far from “understanding” or “comprehending” the texts they analyze, but they are mimicking many human capabilities. As for the potential of LSA, one of its leading proponents has asked:

Suppose we have available a corpus of data approximating the mass of intrinsic and extrinsic language-relevant experience that a human encounters, a computer with power that could match that of the human brain, and a sufficiently clever learning algorithm and data storage method. Could it learn the meanings of all the words to any language it was given? (Landauer 2011, p. 4).

In the present paper, terms such as “M-comprehension” or “M-understanding” will be used to indicate the capabilities of actually functioning computers. There is no doubt that strong pattern-recognition capacity can be achieved with existing hardware and software, but how finely can an AI using LSA identify differences and similarities between book-length texts? I propose to test whether an LSA-equipped AI can make distinctions among significant works of political philosophy, history, and fiction. A modest number of texts were analyzed—a corpus of 100 major works drawn from the list of frequently downloaded books compiled by Project Gutenberg.Footnote 4

In LSA, words and documents are coded as a matrix (a row for each word, a column for each document) which is then condensed to a “semantic space” or “concept space” of lower dimensionality. The element tij of the raw word-document matrix equals the number of times word i appears in document j. Entries of the term-document matrix are then weighted to give a relatively high weight to elements that occur frequently in some but not all of the documents and relatively low weight to words that appear frequently throughout the corpus. Again, there are different ways this can be done, but experience has shown that “log-entropy” weighting performs well.Footnote 5 The weighted word-document matrix will be denoted by A, an m × n matrix with m rows (one for each word) and n columns (one for each document). This particular weighting scheme is discussed in Martin and Berry 2011, pp. 37–39, citing Dumais (1991), Salton and Buckley (1991), Letsche and Berry (1997), and Berry and Browne (2005). Unlike most of the applications of LSA found in the literature, the “documents” in the present paper consist of entire books, not just paragraphs or short passages.

It might be possible for an AI to use the unreduced word-document matrix to identify similarities between different documents in the corpus. Similarities can be defined in a variety of ways. The simplest is to calculate the cosine between the column vectors representing any pair of texts in the weighted term-document matrix A. The cosine of the angle between two vectors s1 and s2 in a vector space is given by \(\frac{{{{\mathbf{s}}_{1~}}.~~{{\mathbf{s}}_2}}}{{\left\| {{{\mathbf{s}}_1}} \right\|~\left\| {{{\mathbf{s}}_2}} \right\|}},\) where the numerator is the dot product of the two vectors and \(\left\| {{{\mathbf{s}}_{\mathbf{i}}}} \right\|~\) is the ordinary Euclidean norm (length) of \({{\mathbf{s}}_{\mathbf{i}}}\). A pictorial representation of the 100 × 100 table of these cosines is given in Fig. 1, with each of the 10,000 squares at locations (j1,j2) in the Figure representing the cosine between document j1 and document j2.

Fig. 1
figure 1

Cosines between document pairs (vectors) in the A matrix; Darker shade indicates greater similarity

Figure 1 is an easier way of seeing the patterns in the cosine pairs than a 100 × 100 table of numbers printed in a tiny font. The shading in the table goes from lighter (smaller cosines) to darker (larger cosines). Quite clearly, the squares down the main diagonal are darkest, because the cosine of a vector with itself is 1. The average degree of similarity (average cosine) in the whole matrix is 0.207 with standard deviation 0.106. It is worth noting that AIs have been quite successful in image recognition (Li et al. 2012; Strang 2016), so there is no loss of interpretability in presenting the results of cosine calculations in this way. However, similarities and differences across books as shown by the cosines in Fig. 1 are not particularly strong.

2 Measuring similarities by singular value decomposition

A better approach, the one employed in LSA, is first to factor the A matrix by singular value decomposition (SVD) (Martin and Berry 2011; Berry and Browne 2005). SVD splits up the A matrix in a way that makes it easier to identify the concepts or genres that underlie the corpus. Many introductions to SVD are given in the literature, so only an outline of the mathematics is given in Appendix 2. Standard software packages like Mathematica and MATLAB include built-in routines to carry out the SVD calculations. The key step in identifying the strongest similarities involves reducing the information in A to a “concept space” of markedly lower dimension. Even with as few as 2 or 3 dimensions in the concept space, unsupervised computations clearly distinguish the main types of text in the corpus.

The crucial equation in SVD is \({{\mathbf{A}}_{\varvec{k}}}={{\mathbf{U}}_{\varvec{k}}}~{{\mathbf{\Sigma }}_{\varvec{k}}}~{\mathbf{V}}_{{\varvec{k}}}^{{\mathbf{T}}}\) (see Appendix 2). Here k is the dimension of the concept space, Ak is an m × n matrix, \({{\mathbf{U}}_{\varvec{k}}}\) is an m × k matrix, \({{\mathbf{\Sigma }}_{\varvec{k}}}\) is a k × k diagonal matrix (all off-diagonal elements are zeros), and \({\mathbf{V}}_{{\varvec{k}}}^{{\mathbf{T}}}\) is a k × n matrix. The diagonal elements of \({{\mathbf{\Sigma }}_{\varvec{k}}}\) are the “singular values” ranked from largest to smallest. Essentially, SVD “diagonalizes” the A matrix and finds the “right” bases for its associated fundamental subspaces (Strang 2016). Following Martin and Berry (2011), the column vectors of \({\mathbf{V}}_{{\varvec{k}}}^{{\mathbf{T}}}\), scaled by the corresponding singular values, are the “document vectors”. Here they will be denoted by vj, and they are the vectors that will be analyzed for similarity in the concept space. With the 100 texts considered here, j ranges from 1 to 100.

Similarity will be measured as the cosine between document vectors in the reduced space. However, it should first be noted that the lengths of each of the vj vectors, as well as the first elements of those vectors, are highly dependent on the sheer length of the texts indexed by j. The correlation between the string lengths of the texts and lengths of the vj is 0.808, and the correlation between the absolute values of the first elements of the vj and the string lengths is 0.885. It is clear that the cosine similarity between two dissimilar vectors can be dominated by the vectors’ lengths if one component is much larger than the others. For example, the cosine between {30, − 2, 1} and {20, 1, 1} is 0.9931, and the cosine between {30, 2, 1} and {20, 1, 1} is 0.9997. If the large first components are ignored, the cosine between {− 2, 1} and {1, 1} is − 0.316, while the cosine between {2, 1} and {1, 1} is 0.949. The first component of these 3-component vectors makes it seem that the two vectors are close regardless of whether the second component is 2 or − 2, whereas the vectors made up of only the second and third components are highly dissimilar. SVD is related to principal component analysis (Shirota and Chakraborty 2015; Shlens 2014), and quite obviously the largest variation among the document vectors will be in the direction of the length component.Footnote 6 The first weighted row vector in \({\mathbf{V}}_{{\varvec{k}}}^{{\mathbf{T}}}\) represents primarily the length of the texts. Therefore, if the weighted vj’s, excluding their first components, are projected onto a lower-dimensional subspace, it is possible to visualize similarities or differences of the most important concepts other than length. Figure 2 shows the cosines between pairs of vectors made up of the second and third components of the columns of \({\mathbf{V}}_{{\varvec{k}}}^{{\mathbf{T}}}\) for k = 3.

Fig. 2
figure 2

Graphic representation of cosines between column vectors in \({\mathbf{V}}_{{\varvec{k}}}^{{\mathbf{T}}}\), dropping first component of each vector, k = 3

Although the complexity of the concepts contained in the full corpus is not captured by this reduced space, a sharp discrimination among the 100 books is possible. The non-fiction works are distinct from the novels. In Fig. 2, the color scheme goes from “hot” (red) to “cold” (blue) as the cosine decreases from + 1 towards − 1. The books have been numbered from 1 to 100, with the first fifty being the works of political philosophy, economics, and history and the bottom fifty being works of fiction. This ordering was done to facilitate exposition and to make the patterns clear to human readers, but it would not be necessary for an AI’s M-comprehension of the texts. The AI could easily do its own ordering based on the cosine similarity measures. Book numbers are shown in Appendix 1, and points on the axes in Fig. 2 correspond to the book numbers. The numbered tick marks are in intervals of 20 from 1 to 100.

In Fig. 2 the books split cleanly into the fiction and non-fiction groups, with each block having high within-group similarity and low out-of-group similarity. The political/philosophical/historical books are the red–orange rectangle in the upper left, while the novels are in the red–orange block in the lower right. The blue off-diagonal blocks indicate that the cosines between books in the two different groups are negative. The contrast between within-group similarity and out-of-group similarity can be summarized by the averages of the cosines in the different blocks of Fig. 2. Table 1 also contrasts the strength of the fiction/non-fiction distinction found with SVD (Fig. 2) to the relatively mild differentiation in the blocks shown in the depiction of raw cosines (Fig. 1).

Table 1 Cosine averages and standard deviations for blocks of books

The difference between the fiction and non-fiction books could hardly be clearer. The within-group cosines are almost always close to 1, while the across-group cosines are almost all negative. Even the exceptions are informative. Among the non-fiction works, Carlyle’s The French Revolution (book #43) has a negative cosine when compared to the other non-fiction books. Indeed, Carlyle’s style is novelistic (Hindley 2009). At the same time, among the works of fiction, Swift’s A Modest Proposal (book #88) is an outlier. But of course, the “modest proposal” was a vicious satire, suggesting that the problem of poverty in Ireland could be solved by cannibalizing the island’s 1-year-old children. The horror of butchering and eating babies presented as though it were a serious policy proposal. In other words, A Modest Proposal was meant to read as if it were non-fiction.

More instances of the ability of the three-dimensional reduced space to pick out unusual books can be seen by expanding the non-fiction and fiction blocks, as is done in Figs. 3 and 4. Consider Fig. 3, the graphic representation of the cosines between the non-fiction books (numbered tick marks are in intervals of 10). In addition to the Carlyle history (#43) already pointed out, it seems that the more modern historical works show somewhat weaker similarity to the ancient historians and the political philosophers. The vectors associated with Grant’s Memoirs, Churchill’s River War, and the March and Beamish’s History of the World War have more than a few negative cosines, but without any dominant pattern.

Fig. 3
figure 3

Graphical representation of cosines for pairs of non-fiction works

Fig. 4
figure 4

Graphical representation of cosines for pairs of novels (add 50 to axis numbering to match the numbering of works in Appendix 1)

Figure 4 shows the cosines between novels. (Note that the automatically generated tick marks of Fig. 4 need to have 50 added to match the numbering of the books in Appendix 1, and are numbered in intervals of 10 from 1 to 50). In addition to the anomalous Modest Proposal (#88) it is also clear that Joyce’s Ulysses (#84) is an outlier. The average cosine between Ulysses and the other novels is 0.158, while almost all the other cosines are greater than 0.8. (The third lowest fiction average cosine is Swift’s Gulliver’s Travels at 0.627). Of course, Ulysses is quite different from typical works of fiction because of its “stream of consciousness” structure (or lack thereof).

Not much additional discrimination among the works shows up as the number of dimensions of the concept space is increased to 4 or 5. However, if the document vectors are projected into higher-dimensional spaces, finer distinctions among the different works are possible. Instead of k = 3, consider a SVD with k = 50. Again, ignore the first component of each of the vj vectors to reduce the influence of length on the closeness measure. The 100 × 100 plot of this cosine matrix is shown in Fig. 5.

Fig. 5
figure 5

Cosine plot for all 100 works, k = 50

Once again, the similarity measures fall into blocks. Books within the fiction block show positive similarity (orange-colored squares). As before, the non-fiction and fiction books have negative (bluish-colored squares) cosines, with a few exceptions. Both of Swift’s satires are similar to many of the non-fiction books.

More interestingly, within the non-fiction block the histories show less similarity to the books that are pure philosophy or political theory. This is illustrated in Fig. 6.

Fig. 6
figure 6

Cosine plot for non-fiction works, k = 50

The histories begin with Titus Livius I (#33) and continue through March and Beamish’s History of the World War (#50). The frequency of negative cosines with the pure political theory and philosophy works is clearly greater for the histories. It also seems that the ancient histories (Titus Livius (#33) through Gibbon II (#42)) show a degree of similarity to each other, as do the “modern” histories after Carlyle (Grant (#44) through March and Beamish (#50)), although these similarities are not particularly strong. The strongest similarities in Fig. 6, however, are the ones between the works of political theory and philosophy (considered as a group). The only strong negative cosines in the upper left corner are between Machiavelli (#1 and #2) and the “modern” economists—Ricardo (#29), Jevons (#30), Veblen (#31) and Keynes (#32)!

The fiction works also show finer distinctions than in the 3-dimensional case: The books were arranged roughly from the least recent to the more recent, except that the fantasy/science fiction novels are grouped together at the bottom of the list. Figure 7 shows that, in general, the more recent the novel, the greater its similarity with other novels in the fiction group.

Fig. 7
figure 7

Cosine plot of novels, k = 50

In Fig. 7, the coloration becomes darker (greater similarity) moving from the upper left corner (the earliest works) down to the lower right corner (most recent works). But it is also clear that the fantasy and science fiction books can be picked out—these are the works after Ulysses (#84; axis point 34 after subtracting 50).

What is seen in Figs. 5, 6 and 7 is that the SVD model can identify sub-categories within the two larger non-fiction and fiction groups. This is because the vectors projected into the 50-dimensional subspace contain more information about the books in the corpus than when only three dimensions are retained. However, the patterns of the 50-dimensional SVD are essentially unchanged if the document vectors are projected into the 100-dimensional space that is the maximum obtainable with this corpus of 100 works. Increasing the number of dimensions past a certain point provides no improvement in the resolution of underlying concepts. This is consistent with the observation that in LSA analyses “[t]he number of dimensions retained in LSA is an empirical issue” (Landauer et al. 1998, p. 269).

3 Discussion

To check the robustness of the results, a comparable sample of 100 texts was created with random words selected from the same “dictionary” that encompassed all the words in the books of Appendix 1. Four groups of random “books” were created, with 25 having length 20,000 words, 25 having length 50,000, 25 having length 100,000, and 25 having length 500,000. The mean number of words of the books in the main corpus is 153,666 while the mean number of words for the books in the random corpus is 167,500, so the two sets of texts are roughly comparable in size. The first singular value for the weighted word-document matrix of the random texts is 17 times larger than the second singular value. The last 99 singular values drop off very slowly decreasing by only a factor of about 2 from the second to the 100th singular value. If the first component of the document vectors (the one that is correlated with the length of the documents) is dropped, the resulting projection of vectors composed of the second and third components onto the reduced concept subspace shows a random pattern of cosines. The pictorial representation of this lack of pattern is shown in Fig. 8. The average cosine between vector pairs in Fig. 8 is 0.005, with a standard deviation of 0.711.

Fig. 8
figure 8

Cosine plot of 100 random-word documents, length component dropped, k = 3

Returning to the main results, why is it that the SVD encodes information about the documents more efficiently than the simple comparison of raw cosines? There is no doubt that the reduction in dimensions reduces “noise” present in the sample of documents. The SVD is picking out vectors in the reduced space that correspond to the directions of greatest variance, so elimination of all but the first few components will eliminate some of the noisy elements of the texts. It must be admitted, however, that exactly what the SVD is finding is somewhat mysterious. As one group of researchers put it.

At this point, there is a great deal of uncertainty about what is being represented in the K-dimensional spaces of LSA. One optimistic possibility is that the K dimensions reflect ontological categories, semantic features, and structural compositions of mental models that would be directly adopted in structural theories of world knowledge representation….[Nevertheless,] [v]ery few researchers would go out on the limb and propose an elegant mapping between the K dimensions of LSA and sophisticated theories of world knowledge. However, most researchers would seriously entertain the possibility of weaker correspondences (Graesser et al. 2000, p. 2, references omitted).

The difficulty would seem to stem from the problem of how the meanings of words are determined, a philosophical question that goes back to Plato (Landauer 2011).

From a broader perspective, one long-term goal of the AI project is teaching an AI to M-comprehend a wide variety of texts. One way of doing this would be to let the AI read voraciously. It is possible that, using the kind of procedure outlined in this paper, an AI would be able to distinguish works of moral and political philosophy from the welter of information that has been digitized, and thereby would be exposed to the full range of literature on human moral systems. Of course, this wide-ranging input would not solve the dilemma posed by the fact that humans do not agree about morality,Footnote 7 but it would provide plenty of input for M-thinking about moral issues. This could be a first step in development of M-ethics in AIs (DeCanio 2017). Other approaches to imparting moral values to AIs are possible, too. Guarini (2006) trained a neural network with case-based moral reasoning, and has argued that “aspects of duty can be preserved for machine ethics” (Guarini 2012, p. 434). Wallach and Allen (2009) and the edited collection by Lin et al. (2014) have explored these issues. Unsupervised computational methods even have been brought to bear on matters of Biblical scholarship (Hu 2012).

Any of the distinctions found among the 100 works analyzed in this paper could have been discovered by an unaided AI. The pictorial grouping into fiction and non-fiction works in Figs. 2 and 5 was done to make it easier for a human reader to see the patterns in the concept space. An AI could have simply used the cosine similarity measure to come up with rankings that would reveal the differences. The AI would have found that the non-fiction books are similar, as are the fiction books, but that each of these groups is dissimilar to the other (with exceptions as noted above). With higher-dimensional concept space, additional distinctions could be drawn. This kind of classification is an initial step towards M-comprehension. It seems plausible that with larger numbers of works included in the database, it would be possible to make finer distinctions. This is a topic for further research.

Is the classification of works into similarity groups equivalent to genuine understanding? With the small amount of data examined here, certainly not. Regardless of how well an AI can classify, highlight, or extract information, the philosophical dilemma posed by the Turing Test remains unsolved. Possession of capabilities, no matter how sophisticated, is not the same as “thinking,” but as Turing pointed out, the distinction may be less important than it seems. It should not be underestimated what an unsupervised AI is capable of doing, even with a limited corpus of works. It can tell that there is something “off” about Swift’s satires, and that histories do not have the same “feel” as works of pure political philosophy. It can discern evolution of the novel from its earliest examples (Le Morte d’Arthur and Don Quixote) to the twentieth century, and it can “see” that fantasy/science fiction novels form a similarity group. This is no small achievement for an AI living in an ordinary PC, whose reading list is (so far) only 100 books drawn from Project Gutenberg’s archive. There is every reason to believe that the comprehension capabilities of text-interpreting AIs will grow as their literary horizons expand.