Introduction

Databases of patent applications and academic publications can be used to investigate the process of research and innovation. For example, patent data can be used to identify prolific inventors (Gay et al., 2008) or to investigate whether mobility increases inventor productivity (Hoisl, 2009). However, the names of individuals in large bibliographic databases are rarely distinct, hence individuals in such databases are not uniquely identifiable. For example, an individual named “Chris Jean Smith” may have patents under slightly different names such as “Chris Jean Smith”, “Chris J. Smith”, “C J Smith”, etc... There may also be one or more other inventors with patents under the same or similar names, such as “Chris J. Smith”, “Chris Smith”, etc... Thus it is ambiguous which names (and hence patents) should be assigned to which individuals. Resolving this ambiguity and assigning unique identifiers to individuals—a process often referred to as named entity disambiguation—is important for research that relies on such databases.

Machine learning algorithms have been used increasingly in recent years to perform automated disambiguation of inventor names in large bibliographic databases (e.g. (Li et al., 2014; Ventura et al., 2015; Kim et al., 2016)). See Ventura et al. (2015) for a review of supervised, semi-supervised, and unsupervised machine learning approaches to disambiguation. These more recent machine learning approaches have often out-performed more traditional rule- and threshold-based methods, but they have generally used feature vectors containing several pre-selected measures of string similarity as input for their machine learning algorithms. That is, the researcher generally pre-selects a number of string similarity measures which they believe may be useful as input for the machine learning algorithm to make discrimination decisions.

Here we introduce a novel approach of representing text-based data, which enables image classifiers to also simultaneously perform text classification. This new representation enables a supervised machine learning algorithm to learn its own features from the data, rather than selecting from a number of pre-defined string similarity measures chosen by the researcher. To do this, we treat the name disambiguation problem primarily as a classification problem—i.e. we assess pairwise comparisons between records as either matched (same inventor) or non-matched (different inventors) (Trajtenberg et al., 2006; Miguélez & Gómez-Miguélez, 2011; Li et al., 2014; Ventura et al., 2015; Kim et al., 2016). Then, for a given pairwise comparison between two inventor records, our text-to-image representation method converts the associated text strings into a stacked 2D colour image (or, equivalently, a 3D tensor) which represents the underlying text data.

We describe our text-to-image representation method in “Comparison-map images” section (see Fig. 1 for an example of text-to-image conversion). We also test a number of alternative representations in “Testing alternative string-maps” section. Our novel method of representing text-based records as abstract images enables image processing algorithms (e.g. image classification networks), to be applied to text-based natural language processing (NLP) problems involving pairwise comparisons (e.g. named entity disambiguation). We demonstrate this by combining our text-to-image conversion method with a commonly used convolutional neural network (CNN) (Krizhevsky et al., 2012), obtaining highly accurate results (F1 99.09%, precision 99.41%, recall 98.76%).

Related work

Inventor name disambiguation studies have often used measures of string similarity in order to make automated discrimination decisions. For example, counts of n-grams (sequences of n words or characters) can be used to vectorise text, with the cosine distance between vectors providing a measure of string similarity (Raffo & Lhuillery, 2009; Pezzoni et al., 2014). Measures of edit distance consider the number of changes required to transform one string to another, e.g. the number of additions, subtractions, or substitutions used in the calculation of Levenshtein distance (1966), or of other operations such as transpositions (the switching of 2 letters) used to calculate Jaro–Winkler distance (Jaro, 1989; Winkler, 1990). Phonetic algorithms, such as Soundex, recode strings according to pronunciation, providing a phonetic measure of string similarity (Raffo & Lhuillery, 2009).

Measures of string similarity such as these have been used to guide rule- and threshold-based name disambiguation algorithms (e.g. (Miguélez & Gómez-Miguélez, 2011) and (Morrison et al., 2017)). They can also be used within feature vectors inputted into machine learning algorithms. For example, Kim et al. (2016) use such string similarity feature vectors to train a random forest to perform pairwise classification. Ventura et al. (2015) reviewed several supervised, semi-supervised, and unsupervised machine learning approaches to inventor name disambiguation, as well as implementing their own supervised approach utilising selected string similarity features as input to a random forest model.

Two-dimensional CNNs have been used extensively in recent image processing applications (e.g. (Krizhevsky et al., 2012)), and one-dimensional (temporal) CNNs have been used recently as character-level CNNs for text classification (e.g. (Zhang et al., 2015)). Also, neural networks (usually CNNs) have been used previously to assess pairwise comparison decisions—e.g. in the case of pairs of: images (Koch et al., 2015), image patches (Zbontar & LeCun, 2016; Zagoruyko & Komodakis, 2015), sentences (Yin et al., 2016), images of signatures (Bromley et al., 1993), and images of faces (Hu et al., 2014). These networks are generally constructed for multiple images to be provided simultaneously as input, such as in the case of Siamese neural networks where two identical sub-networks are connected at their output (Bromley et al., 1993; Koch et al., 2015).

In this work we generate a single 2-dimensional RGB (stacked) image for a given pairwise record comparison. Thus any image classification network that processes single images can be used (with minimal modification) to process our pairwise comparison images, therefore enabling such neural networks to also simultaneously classify associated text records. We demonstrate this using the seminal “AlexNet” image classification network (Krizhevsky et al., 2012).

Data

We use a combination of two labelled datasets in this work to train the neural network and assess its performance. Each dataset was derived by separate authors, from the US National Bureau of Economics Research (NBER) Patent Citation Data File (Hall et al., 2001); i.e. a labelled dataset of Israeli inventors (Trajtenberg et al., 2006) (the “IS” dataset), and a dataset of patents filed by engineers and scientists (Ge et al., 2016) (the “E &S” dataset). These datasets were combined with US Patent and Trademark Office (USPTO) patent data as part of the PatentsView Inventor Disambiguation WorkshopFootnote 1 hosted by the American Institutes for Research (AIR) in September 2015.

Each labelled dataset contains unique IDs (UIDs) that identify all inventor-name records from different patents belonging to each unique inventor. We also extracted several other variables from inventor-name records in the bulk USPTO patent data to use in our disambiguation algorithm: first name, middle name, last name, city listed in address, international patent classification (IPC) codes (i.e. subjects/fields covered by the patent), assignees (i.e. associated companies/institutes), and co-inventor names on the same patent.

Disambiguation algorithm

Our novel inventor disambiguation algorithm involves the following main steps:

  1. (1)

    Duplicate removal: remove duplicate inventor records.

  2. (2)

    Blocking: block (or “bin”) all names by last name, and also by first name in some cases.

  3. (3)

    Generate pairwise comparison-map images: convert text from each within-block pairwise record comparison into a 2D RGB (stacked) image representation.

  4. (4)

    Train neural network: use 2D comparison-map images generated from manually labelled data to train a neural network to classify whether a given pairwise record comparison is a match (same inventor) or non-match (different inventors).

  5. (5)

    Classify pairwise comparison-map images: deploy the trained neural network to classify pairwise comparison images generated from the bulk patent data, producing a match probability for each record pair.

  6. (6)

    Convert pairwise match probabilities into clusters: convert the pairwise match/non-match probabilities generated by the neural net into inventor clusters—i.e. groups of inventor-name records that each belong to a distinct individual inventor. Assigning a UID to each of these groups then leads to a single set of disambiguated inventor names.

Note that the main purpose of the first two steps is to improve computational efficiency. That is, rather than process all possible pairs of patent–inventor records (which has time complexity \({\mathcal {O}}(n^{2})\) for n records), the records are first grouped into similar clusters, or “blocks”, and pairwise comparisons are only made within those blocks. For further detail regarding steps 1 and 2, see “Appendices 1 and 2”. Steps 3–6 are described in detail below.

Comparison-map images

Fig. 1
figure 1

Constructing a string-map image. The first four images show each sub-map for the example word “JEN”, which are summed to construct the final string-map image (right-most image)

Our intent is to assess all possible within-block pairwise comparisons between patent–inventor records, classifying each comparison as either a match or non-match. To do this, we introduce a new method of converting any string of text into an abstract image representation of that text, which we refer to as a “comparison-map” image. Any image classification neural network can then be used to process these images and hence effectively perform text classification.

To generate a comparison-map image, we firstly define a specific 2D character layout—i.e. a grid of pixels specifying the positions of each letter. The layout of this “string-map” is shown in Fig. 1 (identical in each of the five images).Footnote 2 For a given word (e.g. “JEN”), we then add a particular colour (e.g. red) to the pixels of each letter in the word, as well as to any pixels in straight lines connecting those letters. In particular, we add colour to the pixels of the first and last letters (Fig. 1, left-most image), and to all connecting pixels in a line connecting each two-letter bi-gramFootnote 3 (Fig. 1, second and third images, which correspond to the two bi-grams in “JEN”; i.e. “JE” and “EN”). For repeated letters, the bi-gram contains two of the same letter, so we add colour only to the pixel corresponding to that letter (e.g., for the name “JENNY”, we would add colour for four different bi-grams: “JE”, “EN”, “NN”, and “NY”).

To highlight the beginning of each string-map, we also repeat the process for the first bi-gram only (“JE”) in blue, rather than red (Fig. 1, fourth image). The final string-map for the word “JEN” is shown in Fig. 1 (right-most image). If we then add the string-map of any other word to the green channel of the same RGB image (with the first bi-gram again highlighted in blue), the resulting image represents the pairwise comparison of the two words (e.g. Fig. 2, right-most image).

Fig. 2
figure 2

Comparison of two strings. To compare the names “JEN” and “LINDA”, we add the string-map for “JEN” (left image) to the string-map for “LINDA” (middle image) to generate the final comparison image (right image)

For a given inventor name record, we generate string-maps for each variable in the record—i.e. first name, middle name, last name, city, IPC codes, co-inventors, and assignees.Footnote 4 These string-maps are combined into a single image, arranged as shown in Fig. 3, which we refer to as a “record-map”.

Since a given patent–inventor record can have multiple assignees and/or co-inventors, we use a larger string-map for those variables (see Fig. 4, left image). This reduces the possibility that pixels will become saturated in cases where many assignees (or co-inventors) are overlayed onto the same string-map. We also add less colour to each pixel in these larger string-maps, again to reduce the possibility of saturation. For co-inventors, we include only the last names of each co-inventor on the patent (rather than including first, middle and last names, which would increase the saturation of the co-inventor string-maps). Co-inventor and assignee text often includes more than one word so, as was the case with inventor names, we add the pixel colours for each word to the same string-map, colouring pixels corresponding to all within-word bi-grams. Blue is used to colour the pixels of the first bi-gram of each word.

For IPC codes, which contain numbers as well as letters, we use a different string-map shown in Fig. 4 (right image).

Fig. 3
figure 3

Record-map layout. Shows the positioning of each string-map within a given record-map

Fig. 4
figure 4

Larger string-map for assignees and co-inventors, and IPC-map. The larger string-map used to convert a given list of assignees or co-inventors into an abstract image representation (left), and the IPC-map used to convert a given list of IPC classes into an abstract image representation (right)

Fig. 5
figure 5

Comparison-map examples. Two examples of comparison-map images. The left comparison-map image was generated using two matched records (Table 1, rows 1 and 2), and the right image from two non-matched records (Table 1, rows 1 and 3)

We compare any two inventor name records by stacking the two associated 2D record-maps into the same RGB image, one as the red channel and the other as green (with the beginning two-letter bi-gram of each record sharing the blue channel). We refer to the resulting RGB image (or 3D tensor) representation as a “comparison-map” (Fig. 5).

Table 1 Mock records of three patent–inventor name instances

Since red and green combined produce yellow in the RGB colour model, a comparison-map image generated from two similar records should contain more yellow (e.g. Fig. 5, left image), whereas a comparison-map image from two dissimilar records should contain more red and green (e.g. Fig. 5, right image) due to less overlap between the two record-maps. When training on labelled comparison-maps, we expect that the neural network will learn to identify features such as these, which are useful for discriminating between matched/non-matched records. That is, the neural network’s learned pattern recognition on comparison-map images will essentially recognise underlying text patterns which are present in the associated patent–inventor name records.

Note that we chose the particular layout of the letters in the string-map shown in Fig. 1 heuristically, such that vowels (which are less important than consonants when assessing string similarity) are positioned towards the centre of the grid, where pixels are more likely to saturate. We also grouped letters with similar phonetic interpretations, such as “S” and “Z”, close to each other. We anticipated that this heuristic layout might make it more straightforward for the network to learn which features are associated with matches/non-matches. However, we test how the heuristic layouts shown in Figs. 1, 2, 3, and 4 perform compared with alternative random layouts later in “Testing alternative string-maps” section, and find similar performance regardless of the chosen layout.

Benefits of the comparison-map image representation

Our method of converting text into a stacked 2D RGB bitmap for neural net-based image classification has several benefits:

  • The powerful classification capabilities of previous image classification networks can be utilised for text-based record matching, with minimal modification.

  • The neural network learns its own features from the data, rather than learning from a feature vector of pre-defined string similarity measures chosen by the researcher.

  • Minor spelling variations and errors do not alter the resulting string-map very much, and the neural network can potentially learn that such minor features are unimportant for discriminating between matches and non-matches.

  • Matched records with differing word ordering (e.g. re-ordered co-inventor names on different patents) are likely to be identified as matched, due to overlapping pixels.

  • The neural net can potentially learn to ignore certain shapes of common words (e.g. “Ltd”, “LLC”, “Inc”, etc...) which are not useful for discrimination decisions.

  • Our novel disambiguation algorithm performs well under multiple different choices of alternative string-maps other than those shown in Figs. 1, 2, 3, and 4 (see “Testing alternative string-maps” section), suggesting that multiple alternatives of our comparison-map representations allow for robust pattern recognition and feature extraction.

Note that the above benefits of our text-to-image conversion method would also apply to other text-based comparison problems (e.g. data linkage, or disambiguation of academic papers), or to problems that require simultaneous classification of both text and image datasets.

Modifications to neural network architecture

To demonstrate that our text-to-image conversion method can be combined with an image classifier to perform text-based classification, we apply the method to a commonly used image classification neural network; i.e. the seminal “AlexNet” CNN (Krizhevsky et al., 2012). AlexNet was originally designed to classify colour images (\(224\times 224\times\) 3-pixel bitmaps) amongst 1000 classes. We slightly modify the network architecture to enable classification of pairwise comparison-map images (\(31\times 31\times\) 3-pixel bitmaps) into two classes (match/non-match), by altering four hyperparameters as shown in Table 2. We use the NVIDIA Deep Learning GPU Training SystemFootnote 5 (DIGITS) v2.0.0 implementation of AlexNet, and use the Caffe backend (Jia et al., 2014). We use the default settings for the DIGITS solver (stochastic gradient descent), batch size (100), and number of training epochs (30). Rather than use the default learning rate (0.01), we use a sigmoid decay function to progressively decrease the learning rate from 0.01 to 0.001 over the course of the 30 training epochs, as testing indicated that this produced slightly higher accuracies. Instead of the 1000-neuron softmax output layer in AlexNet, we use a 2-neuron softmax output layer, which outputs a probability distribution across our two possible classes (match/non-match).

Table 2 Hyperparameters that differ between the two neural network architectures

Note that the default settings of the DIGITS v2.0.0 implementation of AlexNet transform the input data by: (1) altering input images to show the deviation from the mean of all input images (by subtracting the mean image from each input image); (2) randomly mirroring input images; and (3) taking a random square crop from the input image. The main purpose of performing such transformations is to introduce variability into the training images that are expected to be present in the unlabelled data, however we do not use any of those transformations in this work because our images are much more self-consistent than those in the ImageNet database.

Converting pairwise probabilities into inventor groups, and assigning UIDs

After running the trained neural network on bulk patent data, each within-block pairwise comparison has an associated match probability. To assign UIDs to the bulk data, we convert these pairwise probabilities into linked (matched) “inventor groups” using a clustering algorithm. Each inventor group is a linked cluster of inventor name records which all refer to the same individual. Briefly, the clustering algorithm involves converting each pairwise probability value to a binary value (match/non-match) using a pre-selected probability threshold (\({\bar{p}}\)) as a cut-off. Each matched record is then clustered into a larger inventor group if the number of links (l) it has to the that group is \(\geqslant\) the number of nodes in the group (n) times some threshold proportion value (\({\bar{l}}\)); i.e. if \(l \geqslant n {\bar{l}}\). This removes weakly-linked records from each group. For further detail on the clustering algorithm, see “Appendix 3”. Note that choosing different \({\bar{p}}\) and \({\bar{l}}\) values generates different trade-offs between precision and recall.

Once the clustering algorithm has been applied to each block, every patent–inventor name instance has an associated unique inventor ID, and the disambiguation process is complete.

Results

Here we firstly describe our procedure for dividing our labelled datasets into training and test data. We then evaluate our inventor disambiguation algorithm, compare those results to previous studies, and test alternative string-map layouts.

Labelled and bulk datasets

We use the IS and E &S labelled datasets to train the neural network to discriminate between matched and non-matched pairwise comparisons. Each of the labelled datasets are randomly separated into 80% training data (used to train the neural network) and 20% test data (used to assess algorithm performance). We use 75% of the training data to train the network, and the remaining 25% to perform validation assessments during training in order to monitor potential overfitting.

Duplicate removal and blocking is then performed on the labelled data, and comparison-map images are generated for all possible pairwise record comparisons within each block (723,178 comparison-maps for training and 144,552 comparison-maps for testing).

We also perform duplicate removal and blocking on the bulk data, generating comparison-maps for all possible pairwise within-block comparisons (stored as 3D numerical arrays). The trained neural network is then deployed on the bulk patent data, generating match/non-match probabilities for all pairwise within-block comparisons (112,068,838 comparison-maps). Prior to processing the bulk data, we experimented with multiple different values for the pairwise comparison probability threshold (\({\bar{p}}\)) and linking proportion threshold (\({\bar{l}}\)), based on evaluating the trained neural network on the labelled test data. Different \({\bar{p}}\) and \({\bar{l}}\) values produce different trade-offs between precision and recall, and we use values that produce an optimal trade-off (highest F1 score). We state each \({\bar{p}}\) and \({\bar{l}}\) value whenever quoting results from a given run of our disambiguation algorithm.

Evaluation

To evaluate the performance of the disambiguation algorithm, we use the manually labelled IS and E &S test data to estimate pairwise precision, recall, splitting, and lumping based on numbers of true positive (tp), false positive (fp), true negative (tn), and false negative (fn) pairwise links within the labelled test data, as follows (e.g. (Ventura et al., 2015; Kim et al., 2016)):

$$\begin{aligned} {\text {Precision}}= & {} {{\text {true pos.\ matches}} \over {\text {all pos.\ matches}}} = {{\text {tp}} \over {\text {tp}} + {\text {fp}}}, \end{aligned}$$
(1)
$$\begin{aligned} {\text {Recall}}= & {} {{\text {true pos.\ matches}} \over {\text {total true matches}}} = {{\text {tp}} \over {\text {tp}} + {\text {fn}}}, \end{aligned}$$
(2)
$$\begin{aligned} {\text {Splitting}}= & {} {{\text {false neg.\ non-matches}} \over {\text {total true matches}}} = {{\text {fn}} \over {\text {tp}} + {\text {fn}}}, \end{aligned}$$
(3)
$$\begin{aligned} {\text {Lumping}}= & {} {{\text {false pos.\ matches}} \over {\text {total true matches}}} = {{\text {fp}} \over {\text {tp}} + {\text {fn}}}. \end{aligned}$$
(4)

Higher values are better for precision and recall, while lower values are better for lumping and splitting errors. We also use the pairwise F1 score:

$${{F}}1 = 2 \times {{\text {Precision}} \cdot {\text {Recall}}} \over {{\text {Precision}} + {\text {Recall}}}.$$
(5)

Since the F1 score accounts for the trade-off between precision and recall, it is the primary measure we use to compare the performance of different disambiguation algorithms.

Disambiguation algorithm performance

Table 3 Performance of two example runs of our disambiguation algorithm (bottom rows), compared with other studies evaluated on the IS or E &S labelled datasets

The precision, recall, and F1 estimates for two example runs of our disambiguation algorithm are shown in the bottom two rows of Table 3—first is the highest F1 result obtained using the heuristic string-map character order (Figs. 1, 2, 3, 4), and second is the highest F1 result obtained using a randomly-generated string-map character order (see “Testing alternative string-maps” section for details). Table 3 also shows the best results (highest F1) obtained by previous studies which (1) disambiguate bulk USPTO patent data, and (2) evaluate their results using the same labelled datasets we use in this work (i.e. the IS and E &S datasets). Our inventor disambiguation algorithm performs well compared with these other disambiguation studies (Table 3, bottom row), marginally out-performing the previous state-of-the-art study of Kim et al. (2016) and obtaining a much higher F1 score than Yang et al. (2017) when measured via the IS and E &S datasets.

Table 4 Performance of our disambiguation algorithm relative to other studies, regardless of evaluation dataset

For completeness, we also compare our results to those of other studies which use alternative labelled datasets to the IS and E &S datasets used in this work—i.e. Table 4 shows the best results obtained by each study, regardless of the evaluation dataset. Note that Table 4 provides a less equitable comparison than Table 3, as there is generally a small amount of variation in an algorithm’s F1 score when evaluated on different labelled datasets. Nonetheless, we include Table 4 here for completeness and consistency with previous inventor name disambiguation studies, which often include comparison to other studies with different evaluation datasets. Our disambiguation algorithm is again competitive with the other state-of-the-art inventor name disambiguation algorithms in Table 4, obtaining the highest F1 score compared with the other three studies which quote F1 results (top four rows, highest F1 score in bold), and obtaining the lowest splitting and lumping errors compared with the two studies which do not quote F1 results (bottom three rows, lowest splitting and lumping errors in bold).

Testing alternative string-maps

Here we compare the performance of our heuristic string-map layouts (Figs. 1, 2, 3, 4) to several alternative string-maps. The first alternative string-map we test has random character order; i.e. we keep the pixel co-ordinates identical to the co-ordinates of the associated heuristic layout, but randomise the order of each character (these randomised string-maps are shown in “Appendix 4”, Fig. 7). We also test two alternative string-maps in which we randomise both the pixel co-ordinate layout and character order (“Appendix 4”, Fig. 8). One alternative uses the large string-map for co-inventors and assignees (Fig. 8, right image). The other alternative uses the smaller \(5 \times 5\) pixel string-map for co-inventors and assignees (Fig. 8, left image), leading to a smaller comparison-map (see “Appendix 4”, Fig. 9). We also investigate a string-map with random character order in which we exclude the blue channel for leading bi-grams (Fig. 1, fourth image).

Table 5 Comparison of alternate string-map layouts

Estimates of precision, recall, and F1 for each of these alternative string-maps are shown in Table 5. For each alternative string-map, we ran the algorithm multiple times using different settings of the comparison probability threshold (\({\bar{p}}\)) and linking proportion threshold (\({\bar{l}}\)), and only show results from the run which produced the highest F1 score. Results obtained from each of the alternative string-maps are quite similar to those obtained using the heuristically-determined layout (F1 scores range from 98.99 to 99.09%). This suggests that our method of converting text into abstract image representations facilitates robust feature learning for several alternative choices of string-map structure, such as randomised string-map character order and/or layout, heuristic order and/or layout, different string-map sizes, and the inclusion/exclusion of a blue channel for leading bi-grams.

Examining lumping errors in large inventor groups

Labelled datasets such as the IS and E &S datasets contain far fewer records than the bulk data. While such subsets of labelled data are useful for measuring several facets of algorithm accuracy, they are not very useful for measuring lumping errors from very common names that become relevant only when processing much larger amounts of data (such as the full bulk dataset). This is because, although very common names are likely to be present in relatively small subsets of labelled data, there are far fewer of them compared with the full bulk dataset. When processing the bulk data, large numbers of common first names can lead to a high degree of connectivity in large blocks of common last names, introducing lumping errors in large inventor name groups.

We can investigate the presence of these types of lumping errors by examining the largest inventor groups. In “Appendix 5”, Tables 7 and 8 , we show the name variation for the 10 largest inventor groups obtained using string-maps with heuristic character order and layout (i.e. the version of the disambiguation algorithm shown in the top row of Table 5). In many of the groups, there are several variations of first name within the same inventor group. Many of these look to be lumping errors, rather than different variations of the one first name used by the same inventor. The lumping error issue also seems to be more prevalent for very common Japanese last names such as Takahashi, Nakamura, and Kobayashi.

We also represent the first name variation information in heatmap form in Fig. 6a, which shows, for each of the top 50 largest inventor groups, the proportion that each nth variation of the first name contributes to the group. Inventor groups with only one variation of the first name will be plotted as a dark red (\({\text {proportion}}=1\)) square at the 0th position. Note that we see quite a bit of variation in first names across the top 50 largest inventor groups (Fig. 6a). We should also note that name variations are only indicative of potential lumping errors—i.e. while variations in first names may represent inventor name records that belong to different individuals, in some cases they may represent inventor name records that belong to a single individual which has used different variations of their name on different patents.

Fig. 6
figure 6

Variation of first names within largest inventor groups. Shows the degree of variation of first names within each of the top 50 largest inventor groups. Results from three different versions of the disambiguation algorithm (with heuristic string-map character order and layout) are shown: standard (a), augmented without sub-string splitting applied to large groups (b), and augmented with sub-string splitting applied to large groups (c). Colours show the proportion that each nth variation of the first name contributes to the inventor group. Inventor groups with only one variation of the first name will be plotted as a dark red (\({\text {proportion}}=1\)) square at the 0th position. Note that a has the most first name variations, while c has the least. (Color figure online)

To reduce the amount of lumping errors in large inventor groups when processing the full bulk dataset, we can apply extra disambiguation steps of:

  • separating pairwise matches with mismatched first names (as measured via a Damerau–Levenshtein Distance of \(\geqslant 2\))Footnote 6 or mismatched middle initials, if those records do not share any assignees,

  • for large inventor groups (\(> 100\) records), which are more likely to contain large-group lumping errors due to very common names, using a more stringent criterion to identify mismatched first names (i.e. a Damerau–Levenshtein Distance of \(\geqslant 1\), rather than \(\geqslant 2\)), unless one first name is a sub-string of the other (i.e. to avoid splitting nickname variations such as Chris and Christopher).

Incorporating these extra changes into the disambiguation algorithm leads to a substantial reduction in the number of lumping errors in large inventor groups (see Fig. 6b, c and “Appendix” Tables 9 and 10) compared with in the absence of the changes (see Figure 6a and “Appendix” Tables 7 and 8). For the two different augmented versions of the disambiguation algorithm—one without sub-string splitting applied to large inventor groups (Fig. 6b), and one with sub-string splitting applied to large inventor groups (Fig. 6c)—we see the greatest reduction in lumping errors when large inventor groups are separated (Fig. 6c).

We also examine in Table 6 the degree to which the augmentation of the disambiguation algorithm, with and without sub-string splitting applied to large groups, affects the precision, recall, and F1 scores compared with the standard version of the algorithm. We do this for the two string-map methods that produced the highest F1 scores in Table 5; i.e. heuristic character order with heuristic layout, and random character order with heuristic layout. Table 6 shows that the augmented versions of the disambiguation algorithm have higher precision but lower recall and F1 scores, with sub-string splitting of large groups enhancing the differences from baseline.

Given all of the above considerations, we suggest that if large inventor groups are to be studied, then utilising the version of the disambiguation algorithm with sub-string splitting applied to large inventor groups would produce the most useful disambiguated inventor groups.

Table 6 Comparison of results with and without removal of lumping errors from large inventor groups

Conclusion

We introduced a new entity disambiguation algorithm and applied it to inventor names in USPTO patent applications. The text-to-image representations in our entity disambiguation algorithm provide a novel way of combining image processing with NLP, allowing image classifiers to perform text classification. We demonstrated this with the seminal AlexNet CNN, obtaining highly accurate results. We also analysed several variants of alternative string-maps, and found that the accuracy of the disambiguation algorithm was highly robust to such variation.

Since the core of our disambiguation algorithm is a classification method to determine how similar two text records are, it should be adaptable to other NLP problems which involve text matching of multiple strings, such as academic author name disambiguation, assignee disambiguation, or record linkage problems. For example, for assignee disambiguation, comparison-maps could be generated for pairs of assignee mentions in different patents, which would include string-maps for assignees and associated inventors. The challenges of adapting the algorithm for assignee disambiguation would include identifying a suitable labelled dataset of disambiguated assignees, identifying which fields to include as string-maps in each pairwise comparison-map, and adapting the blocking procedure for assignee data, however we believe these challenges would be solvable. The algorithm could also be modified for less common applications, such as processing records that contain both text and image data. This could be done by combining each record’s associated image with the abstract image representation of the record’s text, in a single combined comparison-map.