In this section, first, we introduce the datasets used for conducting experiments. After that, the evaluation measures used are defined. The details of the baseline models and the experimental setting are provided next. Finally, we present experimental results and discussions.
Dataset
In this study, to be able to compare the experiments with previous works, two datasets are used: USPTO-2M and M-patent which we will describe in more detail as follows.
USPTO-2M: The USPTO-2M is a large-scale patent classification benchmark made publicly available by Li et al. (2018)Footnote 2. The USPTO-2M is extracted from the bulk data available online on the United States Patent and Trademark Office (USPTO)’s websiteFootnote 3. The preprocessed data includes the titles and abstract sections, patent number, and the IPC subclass labels of patent documents. The training data consist of 1,950,247 patents (from 2006-2014), and the test data contains 49,900 patents (2015), with 637 and 606 subclasses, respectively. However, after downloading the data and conducting some data analysis, we found that 1,739 of the training data had no labels. Using the patent number of the data with missing labels, we conducted web scraping from the USPTO’s website. Consequently, the missing labels of 1,719 of the data were extracted and replaced. Similar to Li et al., the documents with less than ten words are excluded. However, we also removed the labels that contained less than 100 documents. We only considered 544 labels at the IPC subclass level. The description of the train and test data are presented in Table 1. It should be noted that there is a slight difference between the dataset used in this paper and the data described in (Li et al., 2018). That is why, in our experiments, we implement their proposed method again on our dataset.
Table 1 Description of USPTO-2M M-patent: Moreover, we used a smaller dataset, which is a subset of the CLEF-IP 2011 (Piroi et al., 2011) dataset. We call this dataset as M-patent dataset. The CLEF-IP dataset includes 1.35 million patents gathered from the European Patent Office (EPO)Footnote 4 and World Intellectual Property Organization (WIPO)Footnote 5. This dataset contains English, French, and German language patents. Similar to Hu et al. (2018a) and their M-CLEF dataset, the title, abstract, description, and claim part of English patents that belong to the F category of IPC taxonomy are extracted. The extracted documents have at least one IPC label, and the labels are chosen at the subclass level of the IPC hierarchy. After removing documents that were missing any of the desired parts and cleaning the data, the dataset includes 69, 522 data. Table 2 depicts the description of the train and test data.
Table 2 Description of M-patent Evaluation measures
One of the problems of previous research on patent classification, as mentioned before, is that different evaluation measures are used when reporting the classification results. Unlike single-label classification, in multi-label classification, one or more labels can be assigned to an output concurrently. Therefore, evaluating the performance of the multi-label classification is harder and more complicated (Wu & Zhou, 2017), especially when the number of labels is high, like in the case of patent classification. Moreover, the distribution of the patent documents across the IPC categories is highly imbalanced (Gomez & Moens, 2014; Lupu et al., 2017). We believe not all the previous evaluation measures suit our problem. However, we try to report the evaluation measures used in previous patent research for the sake of comparability.
Multi-label evaluation measures can be divided into two main groups, namely, label-based metrics and example-based metrics (Tsoumakas et al., 2009). Example-based metrics are calculated independently for each example and then averaged on the total number of examples. In contrast, label-based metrics are calculated for each label independently and then averaged. Two averaging strategies can be utilized when using label-based metrics, namely, macro-averaging and micro-averaging (Wu & Zhou, 2017). Macro-averaging is similar to example-based averaging but for labels instead. On the other hand, in the micro-averaging strategy, the counters of misses and hits are aggregated first, and then the desired metric is calculated only once (Charte et al., 2016). Consequently, the weights assigned to each label in calculating the final measure are not the same, and the uneven distribution of data in each label is taken into consideration. Therefore, when we have imbalanced data, it is common to evaluate the performance with micro-F1 measure (Yang et al., 2009).
Given n patents and m labels, let the number of true positives, false positives, true negatives, and false negatives, be TP, FP, TN, FN, respectively. The micro precision, recall, and F1 measure (Gibaja & Ventura, 2014) are calculated as follows.
$$\begin{aligned} \mathrm {Micro}\, \mathrm {Precision}= & {} \frac{ \sum _{ i=1 }^{ m }{ { TP }_{ i } } }{ \sum _{ i=1 }^{ m }{ { TP }_{ i } } +\sum _{ i=1 }^{ m }{ { FP }_{ i } }}\end{aligned}$$
(7)
$$\begin{aligned} \mathrm {Micro}\,\mathrm {Recall}= & {} \frac{ \sum _{ i=1 }^{ m }{ { TP }_{ i } } }{ \sum _{ i=1 }^{ m }{ { TP }_{ i } } +\sum _{ i=1 }^{ m }{ { FN }_{ i } }} \end{aligned}$$
(8)
$$\begin{aligned} \mathrm {Micro}\,F1= & {} \frac{ 2\times \mathrm {Micro}\, \mathrm {Precision} \times \mathrm {Micro}\, \mathrm {Recall} }{ \mathrm {Micro}\, \mathrm {Precision}+ \mathrm {Micro} \, \mathrm {Recall} }\! \end{aligned}$$
(9)
Moreover, we also report two ranking-based measures, Coverage Error (Wu & Zhou, 2017) and Label Ranking Average Precision (LRAP) (Tsoumakas et al., 2009). Given the ground truth label matrix as \(L\in { \{ 0,1\} }^{ n\times m }\), let the probabilities estimated by the model before converting to binary bipartition using a threshold value be \({\hat{L}} \in { {\mathbb {R}}}^{ n\times m }\). Coverage error indicates, on average, how many labels in the ranked list of the estimated probabilities are required to account for all the true positive labels and is computed as follows:
$$\begin{aligned} \mathrm {Coverage} \, \mathrm {Error}=\frac{ 1 }{ n } \sum _{ i=1 }^{ n }{ \max _{ j:{ l }_{ ij }=1 }{ { rank }_{ ij } } } \end{aligned}$$
(10)
where \({ rank }_{ ij }=\left| \left\{ g:{ {\hat{l}} }_{ ig }\ge { {\hat{l}} }_{ ij } \right\} \right|\).
LRAP is related to the average precision score. However, it uses the concept of label ranking instead of precision and recall. It evaluates the capability of the classifier to assign better ranks to the correct labels associated with each sample and is calculated as follows:
where
, |.| is the cardinality of a set, and \({ \left\| . \right\| }_{ 0 }\) is the \({l}_{0}\)-norm.
Some previous research (Li et al., 2018; Hu et al., 2018a) utilized the evaluation measures from the CLEP-IP competition (Piroi et al., 2011). That is, first predict k (e.g., 1, 5) labels for each document, and then calculate the precision, recall, and F1 measure at top-k for each prediction as shown below. For the sake of comparison, we will report the precision, recall, and F1 at the top 1 label.
$$\begin{aligned} \mathrm {Precision}= & {} \frac{ \mathrm {correct}\, \mathrm {predictions} }{ \mathrm {all}\, \mathrm {predictions} } \end{aligned}$$
(12)
$$\begin{aligned} \mathrm {Recall}= & {} \frac{ \mathrm {correct}\, \mathrm {predictions} }{ \mathrm {all}\, \mathrm {relevant}\, \mathrm {documents} } \end{aligned}$$
(13)
$$\begin{aligned} \mathrm {F1 \, measure}= & {} 2\times \frac{ \mathrm {Precision} \times \mathrm {Recall} }{ \mathrm {Precision} + \mathrm {Recall} } \end{aligned}$$
(14)
For all the metrics, a higher obtained value is better, except for coverage error, in which a smaller value is preferable.
Baselines and experiment setup
Li et al. (2018) evaluated the effect of using different patent sections and the number of words on the classification performance on the USPTO-2M dataset. They concluded that using the first 100 words of the title and abstract will result in the best classification performance. Therefore, to compare our experiments with the DeepPatent model proposed by Li et al. (2018), we will use the first 100 words of the title and abstract for our experiments on the USPTO-2M dataset. Figure 3a shows the number of words in the title and abstract sections of the USPTO-2M dataset.
For the M-patent dataset, we follow the same guideline and select the title and abstract sections. As shown in Fig. 3 (b), about 95% of the combination of title and abstract in the M-patent dataset contains less than 128 words (red line drawn on 128). Consequently, we set the maximum number of words to 128 in the corresponding experiments. Moreover, the standard 80/20 split was applied for the training and validation set. Similar to Hu et al. (2018a), we further employ other deep learning baseline models used for classification, namely, LSTM, BiLSTM, CNN and CNN-BiLSTM. We keep the input text the same for all the models to make a fair comparison.
We describe the baseline models first and then explain the fine-tuning pre-trained language models’ implementation details for the patent classification.
DeepPatent: We used a CNN architecture similar to DeepPatent (Li et al., 2018) and trained 200-dimensional word embeddings based on the skip-gram model. The CNN architecture is as follows: a convolutional layer with three kernel sizes (3, 4, and 5), a max-pooling layer applied to the output of each convolutional layer, and a fully connected layer with units equal to the number of labels and sigmoid activation function. Furthermore, we also examined 200-dimensional word embeddings trained with the CBOW and fastText model.
LSTM: A single layer LSTM with 128 units. Followed by a fully connected layer in which the number of units is equal to the number of labels and sigmoid functions.
BiLSTM: The same architecture with the LSTM model, but the LSTM units, is replaced with the BiLSTM units.
CNN-BiLSTM- A convolutional layer with kernel size 3, rectified linear units (ReLU) activation function, and 128 kernels. The global max-pooling operation was applied to obtain the feature vectors of the input. The feature maps are then fed into a BiLSTM layer with 128 units. Dropout is applied to the output of BiLSTM layer and fed into a fully connected layer with the number of units equal to the number of labels and sigmoid activation functions.
For all the models, Adam (Kingma & Ba, 2014) optimizer and a binary cross-entropy loss function were used. Moreover, we use dropout (Srivastava et al., 2014) after the embedding layer and also before the fully connected layer with a drop rate of 0.2 and 0.25, respectively. Similar to Li et al. (2018), we train DeepPatent for 50 epochs on the USPTO-2M dataset. For the experiments on the M-patent dataset, the number of epochs is set to 40 same as Hu et al. (2018a).
For converting the input text to a sequence of tokens, padding, or truncating to the desired length and implementing the baseline models, Keras library Chollet et al. (2015) was utilized. Each word of the input sequence is represented with a 200 dimension word vector. We tested three different word vectors generated based on CBOW, Skipgram, and fastText for all the models.
Word embedding pre-training: Two common word embedding models, i.e., word2vec, and fastText, are used to generate 200-dimensional word embeddings using the patent texts in our datasets. The Python Gensim library (Řehůřek & Sojka, 2010) is utilized to generate word2vec embeddings based on both the CBOW and Skip-gram model, and the fastText embeddings based on the skip-gram model with a context window size of 5. We implemented the standard preprocessing steps such as removing punctuation, non-alphabetic characters, stop words, converting the text to lowercase, lemmatization, and reducing all the multi-spaces to single space.
Fine-tuning pre-trained language models: BERT, XLNet, RoBERTa, and ELECTRA models are fine-tuned on the downstream task of multi-label patent classification problem. The publicly available Simple Transformers (Rajapakse, 2019) library built on top of the famous Hugging Face transformers library (Wolf et al., 2019) was used for conducting the experiments. The pre-trained model weights are all provided in the Hugging Face transformers library. The base versions of the models were used in our experiments. Table 3 shows the model types and their details.
Table 3 Pre-trained models name and details An advantage of using such libraries is that we can use the pre-trained models in a unified way and conduct pre-training with limited computation resources. Only a few hyperparameters such as batch size, number of epochs, maximum sequence length, and the learning rate were changed for conducting the experiments. These hyperparameters were chosen based on the original paper recommendations and kept the same for all the models. Table 4 shows the hyperparameters in the experimental setting. However, for the experiment on the M-patent dataset we set the number of epochs to 15 with early stopping (Caruana et al., 2001) of patience level 3 to prevent over-fitting.
Table 4 Hyperparameter setting for fine-tuning various pre-trained models All the experiments are carried out on a 180GB RAM operating machine with Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz, 16 cores and two Tesla V100 GPUs with 16 GB of RAM.
Experimental results and discussion
This section presents the experimental results on the two datasets, M-patent and USPTO-2M, in Tables 5 and 6, respectively. We fine-tuned various pre-trained language models for the task of multi-label patent classification and compared them with several neural network architectures proposed in the literature.
Table 5 demonstrates the result of experiments on the M-patent dataset. As can be seen from the table, all the pre-trained language models showed better classification performance by obtaining higher micro-F1 and F1 at the top 1 label prediction. This demonstrates the effectiveness of fine-tuning these models for multi-label patent classification and their ability to capture the patent documents features better than the baseline models. XLNet outperformed all the other models and achieved the highest value of micro-F1 of 0.736 and 0.850 of LRAP. Moreover, the precision, recall, and F1 at the top 1 prediction are 82.29%, 67.70%, and 72.08%, respectively. Among the baseline models, except when using the CBOW word embeddings, DeepPatent obtained the best performance, followed by the CNN-BiLSTM model. DeepPatent with the fastText embeddings achieved 79.77%, 65.52%, and 69.79% of precision, recall, and F1 at the top 1 label prediction, respectively. On the other hand, LSTM models resulted in the lowest micro-F1 and other evaluation measures. The BiLSTM models, which can capture bidirectional dependencies in the text, depict better performance than the LSTM models. In general, the models that include LSTM or BiLSTM achieve higher micro-precision but lower recall. In the case of CBOW embeddings, adding CNN to the BiLSTM model does not introduce performance improvement. Figure 4 presents the performance of the pre-trained models on the M-patent validation set. The difference in the training steps is due to the stopping operation to prevent the models from over-fitting. None of the models exceeded training more than 11 epochs. The global training steps versus the relative time is shown in Fig. 5c, which indicates that XLNet took the longest to train among the pre-trained language models.
Furthermore, we also conducted experiments on the USPTO-2M dataset contributed by Li et al. (2018) and used their DeepPatent as the baseline model. Table 6 demonstrates the result of experiments on the USPTO-2M dataset. Implementing DeepPatent again on the dataset and using our trained Skip-gram word vectors, we obtained a precision, recall, and F1 of 77.10%, 52.00%, and 58.95% at the top 1 label prediction, respectively. With the usage of our trained fastText embeddings instead of word2vec embeddings, we outperformed the original DeepPatent model, resulting in better performance than the original model. However, once again, all the pre-trained language models defeated DeepPatent in terms of all metrics. XLNet achieves a new state-of-the-art performance of 82.72%, 55.89%, and 63.33% on precision, recall, and F1 at the top 1 label prediction, respectively. Moreover, the micro-F1 of 0.523 obtained from the original DeepPatent (Skip-gram) increased to 0.572 using the XLNet model. XLNet also obtained the best LARP (0.808) and coverage error (8.986) on the USPTO-2M dataset.
Table 5 Experiments on M-patent dataset Table 6 Experiments on the USPTO-2M dataset The precision, recall, and F1 measures are reported by converting the models’ output probabilities to binary bipartition vectors. The common approach is to consider a threshold value that transforms all the probabilities higher than the threshold to 1 and 0 otherwise. The results reported in Tables 5 and 6 are all based on considering the conventional 0.5 as the threshold value. However, in tasks when recall or precision is preferred to another, one might consider a different threshold value. The recall is considered more important in many patent-related tasks. Figures 5 and 6 illustrate how various threshold values change the micro-precision, recall, and F1 measure for the XLNet model (as an example) on the USPTO-2M and M-patent experiments.
Moreover, we investigated the effect of using the same length of inputs (the first 128 words) from the beginning of the description and the first claim of the M-patent dataset on the classification performance. Even though the micro-precision increases when using the beginning of the description, the overall micro F1 is not improved. The title and abstract sections resulted in better classification performance, as shown in Fig. 7. This is because the title and abstract sections are more informative than the description or claim for patent classification. However, a longer text might need to be considered when using the description or claim sections of the patent. Nevertheless, which part of the patent results in a better classification performance is not the focus of this study.
Overall, among the baseline deep learning models, the ones using the trained fastText embeddings on the patent corpus show better performance than word2vec embeddings trained based on the CBOW and Skip-gram algorithm. This is because the patent text is not the same as other scientific or academic text. The jargon, complex, and new technical terms abundant in patent text cause out-of-vocabulary problems when using conventional word embeddings. However, fastText can overcome the out-of-vocabulary problem by breaking the unseen word to character n-grams and sums the n-gram embeddings to provide a vector representation for the unseen word. The reason Skip-gram shows better performance than CBOW is that the Skip-gram model is better in learning from smaller data and representing rare words than the CBOW method. However, all of the embedding methods have the limitation of not being able to account for polysemy. Regardless of the context, each word is assigned a specific embedding vector. The ability to account for polysemy is important for understanding the patent context and extracting features from it. In patent documents, the same word may have different meanings in different technology areas. The pre-trained language models use embedding methods that can deal with out-of-vocabulary words and overcome the polysemy problem by pre-training contextual representations of the input text.
For encoding higher-level features from the word embeddings, even though LSTM is known to encode long-term dependencies in the text and to capture the sequential nature in a language, it is not powerful enough to capture the patent documents’ context well for the multi-label classification task. The recurrent neural network-based models might particularly encode information that might not be entirely relevant for the classification task. This problem mainly occurs when the input text is long and very information-rich such as patent, where assigning relevant labels requires a more selective encoding. The Transformers’ attention mechanism of the pre-trained language models is more powerful in capturing the dependencies in a sequence than LSTM or even Bi-LSTM architectures. The LSTM or Bi-LSTM models are not powerful enough to semantically understand the patent text to assign all the associate labels to the patent document. DeepPatent is a CNN model. CNN model is hierarchical, and different kernel sizes can extract various important n-gram features or semantic clues from the input. Therefore, it can capture the phrasal expressions in the patent data for the classification task but fails to model the long-distance dependencies or contextual information. The multi-word units or phrasal expressions contribute to identifying the related categories for each document. However, the general terms in the multi-word combination and the complexity of the patent documents make only relying on these n-gram features not sufficient for the multi-label patent classification task. Nevertheless, the deep Transformer layers of the pre-trained BERT, XLNet, RoBERTa, and ELECTRA, when fine-tuned on the patent text, can encode much richer patent document features considering the bi-directional context and lead to better classification performance.
The experimental results on both M-patent and USPTO-2M indicate that XLNet obtained the best performance on the multi-label patent classification task. One feasible explanation of this outcome is the permutation language modeling objective used in XLNet for pre-training. Except for XLNet, which is a generalized autoregressive pre-training model, other models are considered as autoencoding, i.e., they rely on somehow corrupting the input text and then try to restore it. However, the permutation language modeling in XLNet combines the benefits of both autoencoding and autoregressive methods while avoiding their shortcomings. For example, BERT uses the MLM pre-training objective to make bidirectional pre-training possible but fails to consider the dependency between the masked positions. The data corruption of masking tokens in BERT also leads to a potential pretrain-finetune discrepancy problem. However, the permutation language modeling objective of XLNet avoids these limitations and is more powerful by capturing more dependencies and phrasal structures available in the patent text compared to other pre-trained models. Furthermore, unlike other models that are based on the vanilla Transformer architecture, XLNet adopts the relative positional encoding scheme and the segment recurrence mechanism of the Transformer-XL. Therefore, XLNet can learn dependency better for tasks with longer text sequence. Even though RoBERTa and ELECTRA also improved the MLM objective of BERT, they still did not show better performance when fine-tuned on patent data. RoBERTa is technically similar to BERT, but it improves the MLM by dynamically altering the token masking pattern and conducted pre-training longer on much larger data. However, RoBERTa shows similar performance to BERT on the patent data. The classification results were even slightly better for BERT. This means that adding more data for pre-training does not necessarily mean better performance when fine-tuned on the patent text. Moreover, the RTD pre-training objective in ELECTRA also does not show any improvement over other models on the downstream task of patent classification. Consequently, XLNet, by having the additional autoregressive features, is more suitable for understanding and encoding the phrasal structures and the complex language used in patent documents and provides the best patent classification result among the pre-trained language models.
However, the pre-trained language models take longer to fine-tune, and even though XLNet obtained the best performance with the fixed hyperparameters, it took twice as much time to fine-tune compared to RoBERTa and ELECTRA model. On the USPTO-2M dataset, XLNet took around 22 hours and a half to fine-tune, while RoBERTa and ELECTRA took only around 10 hours and twenty minutes and 10 hours, respectively. CNN models are faster to train than other models and can achieve somewhat acceptable performances. Nonetheless, not much pre-processing of data is needed when using the pre-trained models. Even though we kept the hyperparameters the same for all of the pre-trained models for the sake of comparison, it still led to great results. Separately conducting systematic hyperparameter tuning for each model may lead to better performance, which we leave for future research. Therefore, using the pre-trained language models is promising for enhancing patent classification performance and other patent analysis related tasks that require powerful language understanding models. Moreover, due to the unique characteristics of patent text and its importance, recently, Srebrovic and Yonamine (2020) from Google trained the BERT model, with a slight modification, exclusively on more than 100 million patent documents. They considered 512 input sequence size and limited the maximum masked words to 45 for a sequence. They used a custom tokenizer specifically optimized on the patent text that extends the standard BERT vocabulary to include frequently occurred words in patents. Therefore, preventing the long words that are more common in the patent text from breaking down into smaller word pieces, which should result in performance enhancement of downstream patent-related tasks. The main focus of their work was mainly to show how to utilize this model for contextual synonym generation in patents and its effectiveness. They also highlighted additional applications for the general classification and patent autocomplete tasks. The model and checkpoints, along with the configuration and vocab files, are publicly available at the Google patents-public-data GitHub repositoryFootnote 6. However, the large BERT model consists of 24 layer Transformers, 1024 hidden dimensions, 16 attention heads, with approximately 340 million parameters, which is much larger than the base version of the BERT model. This significant number of parameters requires substantial computational resources and memory to even perform a single inference pass and to adopt this model for the classification task. Therefore, we could not include the Bert for patents in our experiments due to the limitations in computational resources. Additional scaling and speeding techniques may be required to deploy the BERT for patents model with less computational power, which we leave for future study.