In addition to our efforts to build ontologies for representing the heterogeneity of historical data and analyzing connectivity of the historical network, there is a further possible layer of analysis related to the content, represented here by numerical tables from the Sphaera corpus.
In the following, we present our work on building a large, robust, and transparent similarity model between the approximately 10,000 numerical tables of the Sphaera Corpus. The model we have built is a key step towards our overall goal of reconstructing the process of mathematization at the beginning of modern science. So far, comparisons between tables could only be achieved by a skilled historian, and due the difficulty of carefully and consistently inspecting the hundreds of digits composing a typical table, this approach has been strongly limited in terms of dataset size.
Machine Learning of Table Similarity
Machine learning brings the promise of scaling up the analysis of historical content to much larger corpora, in our case, the whole corpus of 10,000 numerical tables. The vast majority of ML approaches work in an end-to-end fashion [6, 23], where the prediction function is learned from the input to the output, based on output labels provided by domain experts. Due to the prohibitive cost of producing these annotations, we have proposed a bottom-up approach which dramatically reduces the labeling cost by only requiring a few annotations of some random yet representative selection of digits occurring in the table corpus [7]. These annotations serve to learn a simple single-digit ‘neural OCR’. This digit recognition layer is followed by a collection of hard-coded functions that compose the detected individual digits and build some desired invariances. Specifically, our proposed neural network architecture, which we call ‘bigram network’, consists of three blocks: (1) A convolutional neural network trained to recognize single digits, which is slid over the whole input table to produce a digit activation map, (2) a block that multiplies adjacent locations of the activation maps to recognize pairs of adjacent digits (00-99) or ‘bigrams’, (3) a block that pools bigrams spatially to compute a 100-dimensional histogram representing the number of occurrences of each bigram in the table. From this histogram representation, similarity scores between tables can be obtained by computing dot products or distances in histogram space.
While the bigram representation discards a lot of (potentially relevant) information, we find that such a loss of information offers robust invariance to the page layout. At the same time, we find that the large number of digits composing each table makes up for such loss of information, and still enables a reliable similarity assessment of the different tables. The similarity model can also be used to support a t-SNE [18] low-dimensional embedding of the dataset which we show in Fig. 3, and where we can observe that pairs of numerically identical tables are embedded to nearby map locations.
Validation using Explainable AI
Explainable AI has provided machine learning with tools to go beyond common validation procedures, in particular, by revealing to the user what are the input features that contribute the most to the model prediction. This is especially important if appropriate annotation data for the evaluation of correct model behavior is not available. For a historical analysis, we are not only interested in a well predicting model, but we also wish to verify that the conclusions drawn from a machine learning model are supported by meaningful data features and not, for example, by confounding variables. Hence, it is desirable to make the model transparent, in particular, features that support the similarity predictions should be clearly identified.
Building transparency into the machine learning model has been a major focus of recent ML research (e.g. [2, 22]), and well-founded approaches have been proposed to attribute the model’s predictions to the input features. Consider \(f:\mathbb{R}^{d}\to\mathbb{R}\) to be some prediction model, \(\boldsymbol{x}\in\mathbb{R}^{d}\) the point of interest, and a Taylor expansion of the prediction at some well-chosen reference point \(\widetilde{\boldsymbol{x}}\). In that case, we can identify the contribution of each feature \(i=1\dots d\) to the prediction by the first-order terms of the expansion:
$$R_{i}=[\nabla f(\widetilde{\boldsymbol{x}})]_{i}\cdot(x_{i}-\widetilde{x}_{i}).$$
(1)
While first-order terms are often suitable to explain the prediction of typical ML classifiers, similarity models are better characterized by the interaction between the variables of the two examples being compared. Relevant information is therefore principally contained in the second-order terms of the Taylor expansion. Denoting by \(s:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}\) some similarity model and \(\boldsymbol{x}\) and \(\boldsymbol{x}^{\prime}\) two points being fed to the similarity model, the joint contribution of features \(i\) and \(i^{\prime}\) of these two points to the predicted similarity is given by:
$$R_{ii^{\prime}}=[\nabla^{2}s(\widetilde{\boldsymbol{x}},\widetilde{\boldsymbol{x}}^{\prime})]_{ii^{\prime}}\cdot(x_{i}-\widetilde{x}_{i})\cdot(x^{\prime}_{i^{\prime}}-\widetilde{x}^{\prime}_{i^{\prime}}).$$
(2)
From this mathematical starting point, we proposed a method called BiLRP [7] that more robustly extracts contributions of interacting features, and that operates by propagating the similarity score backwards, layer after layer, using purposely designed propagation rules, until the input pixels are reached. The BiLRP method is itself an extension of the LRP method [3, 20] from first-order to second-order explanations.
Examples of BiLRP explanations are shown in Fig. 4 where pairs of interacting features that most strongly contribute to the similarity score are drawn with a red connection. We compare explanations of our proposed bigram network with those of a simple similarity model built on a vanilla pretrained deep CNN model for image recognition (VGG-16 [23]). For the bigram network we observe that the most relevant interacting features are indeed the shared bigrams between the pages. Since the network applies a spatial pooling over the map, we see that red connections can also go from one bigram to the same bigram at different locations, as visible for the bigram ‘12’ in our example. In contrast, for VGG-16 we observe that the predicted similarity is grounded in task-irrelevant features such as borders or other geometric features, e.g. arcs and circles. In our case, the bigram network uses more meaningful features to arrive at its predicted similarity score, and it should therefore be preferred.