PatCID: an open-access dataset of chemical structures in patent documents

Morin, Lucas; Weber, Valéry; Meijer, Gerhard Ingmar; Yu, Fisher; Staar, Peter W. J.

doi:10.1038/s41467-024-50779-y

PatCID: an open-access dataset of chemical structures in patent documents

Article
Open access
Published: 02 August 2024

Volume 15, article number 6532, (2024)
Cite this article

Download PDF

You have full access to this open access article

From

View current issue

PatCID: an open-access dataset of chemical structures in patent documents

Download PDF

Lucas Morin ORCID: orcid.org/0000-0002-5829-5118^1,2,
Valéry Weber¹,
Gerhard Ingmar Meijer¹,
Fisher Yu² &
…
Peter W. J. Staar¹

1545 Accesses
6 Altmetric
Explore all metrics

Abstract

The automatic analysis of patent publications has potential to accelerate research across various domains, including drug discovery and material science. Within patent documents, crucial information often resides in visual depictions of molecule structures. PatCID (Patent-extracted Chemical-structure Images database for Discovery) allows to access such information at scale. It enables users to search which molecules are displayed in which documents. PatCID contains 81M chemical-structure images and 14M unique chemical structures. Here, we compare PatCID with state-of-the-art chemical patent-databases. On a random set, PatCID retrieves 56.0% of molecules, which is higher than automatically-created databases, Google Patents (41.5%) and SureChEMBL (23.5%), as well as manually-created databases, Reaxys (53.5%) and SciFinder (49.5%). Leveraging state-of-the-art methods of document understanding, PatCID high-quality data outperforms currently available automatically-generated patent-databases. PatCID even competes with proprietary manually-created patent-databases. This enables promising applications for automatic literature review and learning-based molecular generation methods. The dataset is freely accessible for download.

Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents

Article Open access 06 October 2015

Patent retrieval: a literature review

Article 14 January 2019

Automated patent landscaping

Article Open access 28 March 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Introduction

Recent advances in document understanding enable the acceleration of discoveries in chemistry. Patent documents and scientific publications provide a wealth of knowledge that can only be effectively exploited by automated large-scale processing. Searching information from patent documents is at high stakes for industrial applications, especially with respect to freedom-to-operate, prior-art search, or landscape analysis¹. Additionally, in the chemistry domain, a substantial proportion of scientific findings is disseminated only in patent documents, or only later published in scientific journals^2,3,4. With the continuous growth of patent applications per year, access to chemical information in patent documents poses key challenges. Chemical knowledge, including compounds, reactions, and molecular properties, are presented in documents in a non-standardized way, using multiple modalities such as text descriptions, tables, and depictions. Proprietary databases, such as Elsevier Reaxys⁵ and CAS SciFinderⁿ⁶, aim to provide a solution to search this chemical information in documents. Being manually curated, they are considered the gold-standard for literature search. However, their development requires massive and continuous effort, and given the manual process, they cannot cover all patent documents and collections. To address these challenges, several projects leveraging automatic document processing have been developed, including SureChEMBL⁷, Google Patents SciWalker, Patentscope⁸, or IBM SIIP⁹. In recent years, new pipelines have also been developed to convert documents into chemical structures¹⁰. The currently available evaluations suggest that these databases fall short in comparison to manually-processed databases, both in terms of document coverage and processing quality^1,11,12. Especially, documents published before 2000 or from the Asian Pacific patent offices are not covered, or with poor quality, while they provide unique and disruptive innovation¹³. Furthermore, manually- and automatically-created databases are designed to retrieve a set of document identifiers referring to a specified molecule. Users then need to open the documents and manually search for pages containing references to the molecule they are interested in. This approach does not allow for effective navigation through large collections and poses a substantial limitation for often lengthy patent documents.

In this work, we present PatCID, the Patent-extracted Chemical-structure Images database for Discovery. PatCID allows users to find patents mentioning a given molecule and, conversely, all molecules covered by specific patents. Leveraging state-of-the-art document understanding models to automatically process patent documents, PatCID bridges the gap between manually- and automatically-created patent chemical-databases. Containing documents from major offices (United States, Europe, Japan, Korea, and China) since 1978, PatCID outperforms other chemical patent databases in terms of coverage and quality for both molecular and document retrieval. PatCID also offers a unique interactive document exploration experience. PatCID accelerates discoveries in chemistry by assisting literature review and by providing a basis for training foundational models in chemistry. The processing pipeline used to build PatCID is published open-source and the dataset, including molecules, document identifiers, and locations is openly accessible¹⁴.

Results

PatCID is a chemical-structure dataset automatically created from images in patent documents. Figure 1 illustrates its principal usage for document and molecule retrieval. A molecule can be searched with as-drawn, similarity, or substructure search, and a list of patents referencing the molecule is retrieved. On the other hand, molecules selected from a specific document can be extracted and then leveraged to browse and explore the document. PatCID allows persons in the intellectual property domain to carry out prior-art search or landscape analysis¹, and persons in the organic chemistry domain to review patent literature in various fields such as drug discovery, pharmaceutical chemistry, or material science¹⁵.

**Fig. 1: PatCID usage for document and molecule retrieval.**

To perform a comprehensive evaluation of PatCID, we compare PatCID with state-of-the-art patent databases; the high-level statistics in terms of molecules and documents coverage, the molecular-structure search performances, and the ability to extract molecules from different sections of documents are evaluated. Additionally, we evaluate each component of the processing pipeline used to build PatCID.

Data statistics

PatCID covers documents from five major patent offices, from the United States (USPTO¹⁶), Europe (EPO¹⁷), Japan (JPO¹⁸), Korea (KIPO¹⁹), and China (CNIPA²⁰). The selected documents are associated with the field of organic chemistry by mentioning the term ‘alkyl’. For an exemplary time window of the years 2010–2019, these five patent offices cover 1.06M patent families in the field of organic chemistry, while all 107 patent offices worldwide²¹ cover 1.16M, i.e., the offices covered in PatCID represent 90% of published patent documents in the field of organic chemistry. (Here, a patent family refers to the set of patent documents that disclose the same invention, eventually published in different countries.) In total, PatCID indexes 80.7M molecule images, resulting in 13.8M unique chemical structures. This extensive coverage allows the use of PatCID for applications related to various domains of organic chemistry. Additional details related to the collection selection and statistics are provided in Supplementary Note 1.

Table 1 compares key characteristics of state-of-the-art chemical patent databases. It shows the number of patent documents, molecules, and unique molecules covered by patent databases, as well as which offices are covered and since when. PatCID contains documents that are not manually annotated in Reaxys: the documents published between 1978 and 2001 by the offices in the U.S. and Europe, between 2004 and 2015 in Japan, and between 1998 and 2015 in Korea. PatCID contains 80.7M molecules which is substantially more than Google Patents (39.8M) and SureChEMBL (48.8M). PatCID also contains 13.8M unique molecules, which is more than Google Patents (13.2M) and SureChEMBL (11.6M). Here, molecules (respectively unique molecules) are counted as the number of non-distinct (respectively distinct) canonical Simplified Molecular-Input Line-Entry System (SMILES)²² indexed. Additionally, covering Asian Pacific offices is a great advantage over SureChEMBL, as about 70% of patent documents from Asian Pacific offices are not extended to the United States (see Supplementary Note 1). Further information on obtaining the database characteristics can be found in the Method section. For PatCID, detailed statistics by office are also available in Table 2.

Table 1 Patent databases statistics

Full size table

Table 2 PatCID detailed characteristics

Full size table

Document ingestion pipeline

PatCID leverages state-of-the-art document understanding models to ingest documents. As illustrated in Fig. 2, the ingestion pipeline uses three components: the document segmentation (DECIMER-Segmentation²³), the image classification (MolClassifier), and the chemical structure recognition (MolGrapher²⁴). The document segmentation module locates the position of chemical images in documents. Chemical images comprise molecular-structure images and Markush-structure²⁵ images. (Markush structures are sets of molecules defined using positional and frequency variation indications.) To distinguish molecular-structure images and Markush-structure images, we use an image classification module with three output classes: ‘Molecular Structure’, ‘Markush Structure’, and ‘Background’. This further allows to filter some outliers from the segmentation step, as segmentation errors are included in the ‘Background’ class. Finally, molecular-structure images are converted to molecular graphs using MolGrapher, without stereo-chemistry, and stored as SMILES.

As PatCID is one of the first document-to-molecular-structures pipelines, there is no benchmark for simultaneously evaluating the document segmentation, image classification, and molecule recognition steps. There is even no benchmarks for independently evaluating the document segmentation (with annotated bounding boxes) and the image classification. For this reason, we introduce two benchmark datasets: D2C-RND (Document to Chemical Structures, Random) and D2C-UNI (Document to Chemical Structures, Uniform). Each of these datasets contains three subsets: a first set for evaluating the document segmentation, a second set for image classification, and a third set for the molecule recognition module. Molecules sampled from the recognition subset are taken from images in the classification dataset, which are taken from the pages in the segmentation dataset. This strategy allows us to precisely assess the impact of each module on the overall data quality of the database. D2C-RND is sampled using a random distribution on chemical images, resulting in a higher abundance of recent patents and patents from the U.S. office. This test set can evaluate the average quality of databases. On the other hand, D2C-UNI covers a uniform distribution with respect to the year of publication and publishing office in order to assess databases in challenging scenarios. Specifically, molecule images from older patents and from non-U.S. offices can be of lower quality and use a less standard display style. An example illustrating the diversity of display styles for the same patented molecule in different countries is shown in Supplementary Fig. 6. As the first benchmarks for end-to-end document-to-chemical structures conversion, these benchmarks will benefit future research in this area¹⁴. In total, they contain 700 manually-annotated pages, 753 manually-annotated chemical images, and 364 precisely annotated molecular graphs (MOL files²⁶). More details can be found in the “Methods” section below.

Table 3 presents the performances of these three key ingestion steps. It shows the precision and recall of the page segmentation, the image classification, and the chemical-structure recognition, and for DECIMER-Segmentation and MolGrapher, a comparison with state-of-the-art models. For the recognition module, the precision is computed using InChIKey²⁷ equality, ignoring stereo-chemistry. The evaluation is performed for the random benchmark D2C-RND and the uniform benchmark D2C-UNI. Further details are available in the “Methods” section.

Table 3 Pipeline comparison

Full size table

The segmentation and classification modules achieve high precision and recall of more than 80% on both datasets. The segmentation module outperforms YoDe-Segementation²⁸ in terms of recall and precision by more than 40% on both benchmarks. The recognition module correctly recognize 63.0% of randomly selected molecule images in PatCID. This is substantially higher than OSRA (45.6%), currently used in automatically-created databases pipelines. On this dataset DECIMER achieves 67.2% and MolScribe achieves 75.9%. MolScribe was not available at the time PatCID was created. It can also be noted that some images from our benchmarks are part of MolScribe’s training data. MolGrapher was preferred over DECIMER for its performance on standard benchmarks (see ref. ²⁴) and its runtime performance advantage, allowing it to be run using CPU only (see Supplementary Table 1). More details on the computational considerations can be found in the Method section below. For all components, models perform better on the random set D2C-RND than on the uniform set D2C-UNI, confirming that documents published recently and in the United States are easier to automatically process. The PatCID ingestion pipeline includes basic filtering steps, such as verifying that the predicted molecular structures contain only one fragment. Based on MolGrapher filtered precision, the precision of the complete PatCID processing pipeline is 54.5% on D2C-RND and 41.3% on D2C-UNI. The recall of the complete pipeline is 46.0% on D2C-RND and 44.5% on D2C-UNI. Qualitative examples of the ingestion pipeline predictions are shown in Supplementary Figs. 1 and 2.

Search evaluation

In this section, we compare the molecule and document retrieval performance of PatCID with state-of-the-art databases.

Each benchmark dataset contains pairs of molecules and patent documents, from which the molecules have been extracted. By searching for documents in various databases, we compute the document retrieval performance, defined as the percentage of documents retrieved with chemical annotation attached, and we compute the molecule retrieval performance, defined as the percentage of molecules retrieved from the correct reference documents. A query molecule is retrieved if the annotation and ground-truth have identical InChIKeys, ignoring stereo-chemistry. The complete querying process for each database is explained in the Methods section. A comparison of automatically-curated databases will be presented, and a comparison of manually-created databases will follow.

Table 4 compares the recall of molecules and annotated documents of state-of-the-art automatically-created databases on benchmarks D2C-RND and D2C-UNI. For the random set D2C-RND, PatCID achieves a molecule recall of 56.0%, which is higher than Google Patents with visual annotations (36.5%) and higher than Google Patents and Reaxys with visual plus textual annotations (41.5%). For the challenging set D2C-UNI, PatCID achieves a molecule recall of 47.6% and substantially outperforms SureChEMBL with visual annotations (4.9%) and Google Patents with visual annotations (9.8%). It also surpasses Reaxys with textual and visual annotations by more than 10%. PatCID data quality outperforms all automatic databases by a substantial margin. For D2C-RND, the annotated document recall is 100%, compared to 68.2% in Google Patents, and for D2C-UNI, 98.2%, compared to 67.0% in Google Patents. PatCID has substantially better document coverage. It can be noted that the low document coverage of SureChEMBL is due to the missing coverage of Asian Pacific patent offices. While the PatCID ingestion pipeline only covers visual representation of molecules, its quality and robustness still enable it to surpass state-of-the-art automatically-created databases. SureChEMBL and Google Patents also rely on textual data, and molecular structures information (MOL files) directly provided by the USPTO. Supplementary Table 2 reports the overlap between textual and visual annotations for molecules in Google Patents and SureChEMBL.

Table 4 Search comparison for automatically-created databases

Full size table

Table 5 compares the recall of molecules and annotated documents of state-of-the-art manually- and automatically-created patent databases. PatCID molecule recall outperforms manual annotations of SciFinder for both D2C-RND (56.0% against 49.5%) and D2C-UNI (47.6% against 47.0%). Also, the PatCID annotated document recall is higher than Reaxys with manual and automatic annotations for D2C-RND (100% against 68.8%) and D2C-UNI (98.2% against 67.0%). This advantage of document coverage allows PatCID to compete with Reaxys. Indeed, for D2C-RND, PatCID achieves better molecule retrieval performance than Reaxys, even though Reaxys combines manual and automatic annotations retrieved from images as well as text. Additionally, Reaxys and SciFinder benefit from exploiting the patent families grouping. For example, when searching for a molecule in a Korean patent, SciFinder and Reaxys are allowed to retrieve the query molecule from any patent in its family, for instance a patent from the U.S. patent office. This is an advantage because the Korean patent depiction style is typically more challenging to automatically process than U.S. patent documents, in which the style is more standardized (see Supplementary Fig. 6). For D2C-UNI, Reaxys has a molecule recall of 51.2%, which is better than the 47.6% molecule recall in PatCID.

Table 5 Search comparison for manually- and automatically-created databases

Full size table

Figure 3 illustrates the proportions of molecules covered in the PatCID and Reaxys databases for the random (D2C-RND) and uniform (D2C-UNI) benchmarks, and their subsets restricted to documents annotated in Reaxys. For the D2C-UNI benchmark, Reaxys, with its manual and automatic annotations, covers 51.2% of molecules, while PatCID covers 47.2%, but together they cover a total of 67.1%. Even though Reaxys performs better on average, some of the molecules correctly found in PatCID are not found in Reaxys. Even restricting the evaluation to documents annotated in Reaxys, PatCID covers 8.7% of molecules from D2C-RND and 5.5% of molecules from D2C-UNI, which are not covered in Reaxys. To complement this analysis, a comparison of the number of molecules annotated per patent in PatCID and Reaxys for the random (D2C-RND) benchmark is shown in Supplementary Fig. 5. PatCID bridges the gap between automatically- and manually-created databases, and stands out as a complementary tool to manually-curated databases.

**Fig. 3: Search comparison between PatCID and Reaxys.**

Document coverage evaluation

Patent documents in the field of organic chemistry are typically written following two different styles. In the first case, a patent begins by enumerating a large number of molecular structures, and thereafter, for selected key molecules a detailed description and synthetic routes are presented. In the second case, a patent is structured such that from the start, a limited number of molecules is described in detail. Molecules in the description (before examples) section refer to molecules that are displayed and for which no synthesis or properties are provided. Molecules in the description (examples) section refer to molecules that are displayed and for which a synthesis or properties are provided.

This section presents an evaluation of the coverage of different document sections using two documents that are typical examples of different writing styles. US20220127225 has overall a very large number of molecules and for only a few molecules, the synthesis is described in the examples. US9096558 has overall a few molecules, and for all molecules the synthesis is described in the examples. In these two documents, the positions of all chemical structures were manually annotated. In each section, 50 images (if available) were randomly selected and their molecular structures were precisely annotated. In total, this test set was created by manually annotating the position of 1822 molecule images in 235 pages, as well as 141 molecular graphs (MOL files).

Table 6 shows the percentage of correctly retrieved molecules from different patent sections in different chemical-structures databases. PatCID's fully automated process allows it to cover entire documents, including the abstract, the drawings, the description (including examples), and the claims sections. On the other hand, due to the limited workforce, manually-curated databases made the choice to be restricted to the molecule in the examples and to the molecules in the claims section. Doing so, some key patented compounds can be missed. For instance, as illustrated in Table 6, the patent US20220127225 contains mainly molecules before the examples subsection, with virtually none found in Reaxys or SciFinder, whereas PatCID retrieves 78% of them. An example of a page containing only molecules missed in SciFinder and Reaxys, and almost all found in PatCID, is shown in Supplementary Fig. 3. These molecules illustrated before the examples section can be all the more valuable as some of them are not found in any entries of the entire Reaxys and SciFinder databases. A qualitative example of such molecules is shown in Supplementary Fig. 4. For these reasons, PatCID has a clear advantage over SciFinder and Reaxys with respect to the coverage of sections within documents.

Table 6 Document coverage comparison

Full size table

Interactive document exploration

Figure 4 illustrates an example of document exploration with PatCID. Contrary to SureChEMBL, Google Patents and Reaxys, given a query molecule, PatCID not only finds the documents referencing this molecule but also keeps provenance to its explicit location within documents. For patent documents that can span hundreds of pages and contain thousands of similar molecules, this feature is very useful. It allows to interactively explore documents, easily referring to neighbouring content of the query molecule. It may show related molecular structures or, as depicted in Fig. 4, the synthesis of the molecule.

Providing a dataset of annotated chemical structures, embedded in documents, PatCID can also serve as a foundation for building multi-modal document understanding methods²⁹.

Discussion

Our extensive comparison between PatCID with state-of-the-art chemical patent-databases shows that recent advances in document mining allow (1) to substantially increase data quality in automatically-created chemical patent-databases and (2) due to better document coverage, to compete with manually-curated databases.

The PatCID ingestion pipeline is based on state-of-the-art document understanding models. Other works introduced workflows for converting PDF documents to chemical structures, including closed-source projects such as MolMiner³⁰, CLiDE³¹, or α-Extractor³², and the open-source project DECIMER-AI¹⁰. Similarly to these works, the PatCID ingestion pipeline can process any type of document containing chemical images, such as research articles. However, it is specially optimized for processing patent documents with high precision. Our method also differentiates from others due to its runtime, especially since MolGrapher runs on CPU about 2 times faster than DECIMER-AI (see Supplementary Table 1). The end-to-end document to chemical-structure benchmarks we introduce can also serve as the basis for evaluating future development in this research direction. Reaxys’ and SciFinder’s manual annotations have an advantage over automatic ingestion pipelines with respect to data quality, as all extracted molecules should be correct. However, for applications such as freedom-to-operate and prior-art search, recall is arguably the most critical metric¹. This is where PatCID has an advantage. Automatically-generated databases are also claimed to be facing the limitation that key compounds may be hard to find among all annotated compounds, which include solvents, radicals, or fragments¹¹. Such irrelevant and abundant molecules would have many occurrences in the database. However, in PatCID, 88% of molecules have less than 5 occurrences, with molecules counted only once per document (see Supplementary Fig. 8). Irrelevant and abundant compounds such as solvents, radicals, or fragments only represent a small fraction of molecules found in PatCID. A comparable analysis leads to the opposite conclusion for SureChEMBL visual and textual annotations¹¹. It suggests that such irrelevant compounds are more often found in text, which is not a problem for PatCID, which only takes images into account. Further analysis of the distribution of the number of occurrences of molecules in PatCID is found in Supplementary Note 2. It is worth pointing out that for specific use cases, compound relevancy can be arbitrarily defined, and users may be looking for a way to identify specific subsets of compounds in the database^33,34,35.

Patent documents contain critical information related to chemical structures, including measured properties, or synthesis paths, which are not necessarily published in research articles^2,3. To assess the exclusivity of molecules in PatCID, the overlap between molecules in PatCID and PubChem³⁶ is computed (see the Methods section). Only 7.0M molecules (out of 13.8M) in PatCID are found in PubChem, confirming that PatCID provides novel and exclusive information. A qualitative example of molecules exclusive to PatCID is shown in Supplementary Fig. 4. Additionally, as PatCID contains a large portion of the world’s patented molecules, its analysis can provide key elements for understanding patented organic chemistry. Enabling large-scale processing of patent data is not only critical in the patent space. Learning-based molecular generation methods can benefit from PatCID, as a large corpus of 13.8M unique chemical structures can be used for training models^37,38,39.

In conclusion, PatCID is a chemical-structure database sourced from patent publications. Leveraging state-of-the-art document understanding models, PatCID surpasses automatically-generated databases in terms of data quality by substantial margins. With its extensive document coverage, PatCID can even compete with gold-standard manually-curated databases. PatCID accelerates discoveries in chemistry by assisting with patent literature review, as well as providing a basis for training molecular generation models. In the future, PatCID will also aim to integrate chemical-structures information from text, polymers, and a subset of Markush structures. The processing pipeline used to build PatCID is published open-source, and the dataset, including molecules and their corresponding document locations, is freely accessible for download¹⁴.

Methods

Document ingestion pipeline

This section describes the document ingestion pipeline, illustrated in Fig. 2.

Document segmentation

For segmentation, the DECIMER-Segmentation model²³ was used. It uses Mask-RCNN⁴⁰ to predict an initial mask for each chemical structure in a PDF page and a deterministic mask expansion algorithm to refine the initial masks. The mask expansion algorithm was optimized and multi-processed to allow large-scale processing. The model was trained on pages of patent documents to improve its precision and recall for this application domain. For further details, we refer to publication²³.

Image classification

To classify segmented images, we introduce MolClassifier. The image classification module enables Markush structures to be filtered out, since, although recent attempts have been made⁴¹, there is as yet no reliable approach available for the automated recognition of Markush structures at scale. MolClassifier uses Mask R-CNN⁴⁰ with three output classes: ‘Molecular Structure’, ‘Markush Structure’, and ‘Background’. Here, using a segmentation network instead of a classification-only network allows the training to benefit from stronger supervision. Especially, small details at the image level, such as the R-groups are critical to distinguish molecular structures and Markush structures. In this case, label annotations can be converted to mask annotations with no additional cost, given that the images are black and white, and contain only the molecules. An alternative approach can be to classify multiple patches of the chemical image⁴². To train the MolClassifier, a dataset of 15,720 manually-labelled chemical images was created. Selected chemical images are randomly selected from the outputs of the segmentation module for documents from the USPTO. As the first classification dataset for molecules and Markush structures in patent documents, this set can aid future research in this domain¹⁴. The training images are augmented using standard image augmentations of scaling, rotation, blurring, and noising with pepper patches. Separating the image segmentation and classification modules decomposes the molecule segmentation problem into two simpler tasks. This classification step is particularly easy to train and supervise, using label annotations only.

Molecule recognition

The molecule recognition step is performed using MolGrapher²⁴. MolGrapher is a graph-based model for converting 2D molecular structure images to machine-readable molecular descriptions. The model comprises a deep keypoint detector and a graph neural network that classifies atoms and bonds. The model demonstrates a precision advantage over rules-based models and is competitive with other learning-based approaches. The model is trained on synthetic images generated using RDKit⁴³. Further details can be found in the original publication²⁴. This includes an extensive evaluation of standard benchmarks and an analysis of the model robustness. The model robustness is especially important for low-resolution and unconventional images frequently found in documents from patent offices in Asian Pacific. Besides, running MolGrapher at scale allowed us to compute the distribution of common superatoms, i.e. abbreviated substructures, in PatCID (see Supplementary Fig. 9). Such information is valuable to guide future developments of Optical Chemical Structure Recognition models.

Database evaluation

Benchmarks and metrics

Here, we characterize benchmarks and metrics used in the evaluation of the ingestion pipeline and the database.

Two benchmark datasets are introduced: D2C-RND (Document to Chemical structures, Random) and D2C-UNI (Document to Chemical structures, Uniform). D2C-RND contains 325 pages, 378 images, and 200 molecules, following a random distribution of chemical images. This dataset is intended to reflect the average quality of annotations in PatCID. Since the number of published patents has increased over time, this set contains mainly recent patents. D2C-UNI contains 375 pages, 375 images, and 164 molecules, following a uniform distribution over the year of publication and publishing office of chemical images. This dataset is intended to cover diverse molecules to assess the quality of annotations in a challenging scenario. Older patent documents and from non-U.S. offices generally display molecule images in a less standardized way or with lower resolution and are ultimately more difficult to automatically process with high accuracy. To create these two sets, an intermediate first set was selected by sampling 1400 random pages for each patent office. In these 7000 pages, the location of 15465 chemical images was annotated, including molecules, Markush structures, or polymers with bounding boxes using Label Studio⁴⁴. For each office, the distribution of the number of images per slice of 10 years was computed. Additionally, the intermediate set can be used to estimate the number of images per office in PatCID by finding the number of images per page in the intermediate set and normalizing it with the number of pages in PatCID. Next, the D2C-UNI dataset was created. One image per unique page was selected in the intermediate set. Given that molecules from the same page are likely to be similar, this strategy increases the diversity of molecules in the set. Then, images were sampled according to the uniform distribution over year slices and offices computed from the intermediate set. Finally, we built the D2C-RND dataset. From the intermediate set, images were randomly sampled following the number of images per office estimated in PatCID. Given the limited number of manual annotations performed, the use of two different sampling strategies for the D2C-RND and D2C-UNI benchmarks allows the evaluation to be overall more representative of the full PatCID dataset.

In both datasets, images are annotated by classifying them using three labels: ‘Molecular Structure’, ‘Markush Structure’, and ‘Background’. For molecular structures, their molecular graph is annotated using the molecule editor ketcher⁴⁵. An application was built to efficiently carry out these annotations (see Supplementary Fig. 10). Especially it allows to import initial predictions from an Optical Chemical Structure Recognition model and edit them, rather than starting from scratch. Graph reconstruction models such as MolGrapher, which preserves atom locations, allow the annotator to quickly map the molecular graph with the image, and gain efficiency.

To evaluate the ingestion pipeline, D2C-RND and D2C-UNI are leveraged. The precision and recall are computed for each ingestion component. For the segmentation module, a predicted bounding box is considered a true positive if its intersection over union with the ground truth is higher than 95%. For the classification module, the precision and recall of the predicted ‘Molecular Structure’ class are computed. Finally, for the recognition module, the percentage of recognized molecular images is evaluated. In practice, molecules are considered recognized if the prediction and ground truth have identical InChIKeys, ignoring stereo-chemistry.

To compare the chemical patent databases, the annotated document and molecule recall are computed on D2C-RND and D2C-UNI. For each benchmark, which consists of pairs of molecules and associated patents, the annotated document recall is the percentage of retrieved patents with at least one chemical annotation attached. On the other hand, molecule recall is defined as the percentage of molecules that are retrieved and associated with the correct reference patent.

Databases querying

This section describes the querying process of each chemical patent-databases: SciFinder, Reaxys, Google Patents, SureChEMBL, PatCID and finally PubChem. The assessment is based on data available in January 2024.

For SciFinder, substances are manually searched by batches of 25 using the advanced search fields ‘InChIKey’ and ‘Patent identifier’. For each batch, SciFinder retrieves a list of molecules matching any of the query InChIKeys and referenced in any of the query patent identifiers. It can be noted that this batching may induce false positives to the advantage of SciFinder, but also considerably accelerates the querying process. In SciFinder, ions are stored together with their counterion as one unique compound. Then, charged molecules can not be matched using their InChIKey. Therefore, queries of charged molecules are instead done using the SciFinder molecule editor, which allows the import of SMILES strings and, ultimately, correctly retrieves charged molecules. SciFinder does not allow to distinguish molecules extracted from text or images, and the matched patent documents can be any patent from the patent family of the query. For Korean patent documents, the patent identifier is the application number, while for other patent offices, it is the publication number. In each office, to get the year of publication of the oldest annotated patent reported in Table 1, we manually search for the oldest patent available in CAS PatentPak, which has substances attached.

For Reaxys, patent documents are manually searched using the ‘Query Builder’ capabilities. Retrieved patent documents can be downloaded together with their annotated molecules attached. Manual and automatic annotations are searched separately by batches of 50 samples. For fair evaluation, manual annotations are searched using the search field ‘Common patent number’. It allows to match any patent from the patent family of the query. We consider that manually annotating any patent in a family indirectly annotates all of them due to their linking. For automatic annotations, patent documents are searched using their ‘Patent number’, which only matches the exact query document. In this case, annotations are not shared with all documents in the patent family because the quality of automatic annotations depends on the display style of individual patents. Annotated MOL files are obtained from XML files downloaded in Reaxys. The molecules stored as IUPAC⁴⁶ names are disregarded. Discriminating the manual and automatic annotations in Reaxys allows us to compare the PatCID and Reaxys automatic ingestion pipelines.

Annotations from Google Patents are retrieved using the BigQuery dataset ‘Google Patents Research Data’⁴⁷. The database is queried using the patent publication numbers to get lists of annotated SMILES with sources ‘text’, ‘mol’, ‘image’, or ‘pdf’ (treated as image). SMILES containing ‘*’ are disregarded, as they define Markush structures and not molecules. In each office, to get the year of publication of the first annotated patent reported in Table 1, patent publication numbers were ordered alphabetically.

For SureChEMBL, the publicly available bulk download⁴⁸ of the database is used. For each patent, a list of SMILES is obtained with their sources: ‘text’, ‘mol’, or ‘image’.

Patent identifier formats are adapted to match each database standard. For Reaxys, Google Patents, and SureChEMBL, we split salts into individual ions to match annotations in our benchmarks. Given that SureChEMBL and Google Patents also rely on MOL files of images directly provided by the USPTO, for a fair comparison, molecules stored in SureChEMBL and Google Patents with a source field ‘mol’ are counted as images.

For PubChem, the bulk download of SMILES⁴⁹ in the database is used. To compute the overlap between PubChem and PatCID efficiently, databases are partitioned into batches where each SMILES is assigned a batch given its number of carbon, nitrogen and oxygen atoms. Then, we compute the percentage of SMILES from PatCID found in PubChem, checking for equality of canonical SMILES strings. We ensure this comparison is valid by computing PubChem and PatCID canonical SMILES using the same RDKit algorithm.

Finally, PatCID stores molecules obtained by running MolGrapher on all segmented chemical images, ignoring the predictions from the classification module. This allows to maximize the recall of the pipeline, true negatives of MolClassifier still being annotated by MolGrapher.

Computational considerations

Each processing step is containerized, i.e., the segmentation, classification, and recognition modules. The OpenShift Container Platform is then used to allocate resources at scale. The ingestion is achieved using CPU-only nodes with AMD EPYC 7513 32-Core Processor @2600.000 MHz and 528 GB of RAM. On each CPU node, 32 pods are instanced, each running 4 threads. Using one pod of 4 threads, the average segmentation speed is 8.0 s per page, and the molecule recognition speed is 12.8 s per image. This measure is computed on the USPTO document collection. Optimal resource allocation is achieved by minimizing the number of threads and adhering to memory constraints.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The PatCID dataset is available on Zenodo⁵⁰. The benchmark datasets are available on Zenodo⁵¹. The training datasets are available: the image classification dataset can be downloaded on Zenodo⁵²; the molecule recognition model training dataset can be downloaded on Hugging Face⁵³. The models weights used in this study are available: the document segmentation model, DECIMER-Segmentation⁵⁴; the image classification model, MolClassifier⁵⁵; the molecule recognition model, MolGrapher⁵⁶. To help visualizing the PatCID dataset for test purposes, readers are provided access to a user interface that is currently deployed on IBM’s systems by contacting IBM’s Deep Search team at deepsearch-core@zurich.ibm.com. Source data are provided with this paper.

Code availability

Examples showing how to use the PatCID dataset to retrieve molecular structures or patent documents are available on GitHub¹⁴ and Zenodo⁵⁷. The code for the document segmentation model, DECIMER-Segmentation, is available on GitHub⁵⁸. The code for the image classification model, MolClassifier, is available on GitHub⁵⁵ and Zenodo⁵⁹. The code for the molecule recognition model, MolGrapher, is available on GitHub⁵⁶ and Zenodo⁶⁰. The code for the molecular graph annotation tool is available on GitHub⁶¹ and Zenodo⁶².

References

Ohms, J. Current methodologies for chemical compound searching in patents: a case study. World Patent Inf. 66, 102055 (2021).
Article Google Scholar
Bregonje, M. Patents: A unique source for scientific technical information in chemistry related industry? World Patent Inf. 27, 309–315 (2005).
Article CAS Google Scholar
Southan, C., Varkonyi, P., Boppana, K., Jagarlapudi, S. A. & Muresan, S. Tracking 20 years of compound-to-target output from literature and patents. PLoS ONE 8, 1–13 (2013).
Article Google Scholar
Magariños, M. P. et al. Illuminating the druggable genome through patent bioactivity data. PeerJ 11, e15153 (2023).
Article PubMed PubMed Central Google Scholar
Lawson, A. J., Swienty-Busch, J., Géoui, T. & Evans, D. The Making of Reaxys—Towards Unobstructed Access to Relevant Chemistry Information Ch. 8, 127–148 (American Chemical Society, 2014).
Gabrielson, S. W. SciFinder. J. Med. Libr. Assoc. 106, 588 (2018).
Article PubMed Central Google Scholar
Papadatos, G. et al. SureChEMBL: a large-scale, chemically annotated patent document database. Nucleic Acids Res. 44, D1220–D1228 (2015).
Article PubMed PubMed Central Google Scholar
Eiblmaier, J., Mazenc, C., Geppert, D., Isenko, L. & Saller, H. Addition of chemical search capabilities to PATENTSCOPE: turning a full-text search system into a chemistry database. In Abstracts of Papers of the American Chemical Society, Vol. 253 (American Chemical Society, 2017).
Lelescu, A. et al. The Strategic IP Insight Platform (SIIP): a foundation for discovery. In 2014 Annual SRII Global Conference (eds Singh, K. et al.) 27–34 (IEEE, 2014).
Rajan, K., Brinkhaus, H. O., Agea, M. I., Zielesny, A. & Steinbeck, C. DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications. Nat. Commun. 14, 5045 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Senger, S., Bartek, L., Papadatos, G. & Gaulton, A. Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents. J. Cheminform. 7, 49 (2015).
Article PubMed PubMed Central Google Scholar
Ohms, J. Validity of PubChem compounds supplied by Patentscope or SureChEMBL. World Patent Inf. 70, 102134 (2022).
Article Google Scholar
Park, M., Leahey, E. & Funk, R. J. Papers and patents are becoming less disruptive over time. Nature 613, 138–144 (2023).
Article ADS CAS PubMed Google Scholar
Morin, L., Weber, V., Meijer, G. I., Yu, F. & Staar, P. W. J. PatCID GitHub https://github.com/DS4SD/PatCID (2024).
Gadiya, Y., Shetty, S., Hofmann-Apitius, M., Gribbon, P. & Zaliani, A. Exploring SureChEMBL from a drug discovery perspective. Scientific Data 11, 507 (2024).
United States Patent and Trademark Office (accessed January 2024) http://uspto.gov.
European Patent Office (accessed January 2024) https://www.epo.org.
Japan Patent Office (accessed January 2024) https://www.jpo.go.jp.
Korea Intellectual Property Office (accessed January 2024) https://www.kipo.go.kr.
China National Intellectual Property Administration (accessed January 2024) https://www.cnipa.gov.cn.
LexisNexis TotalPatent One (accessed January 2024) https://www.totalpatentone.com.
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).
Article CAS Google Scholar
Rajan, K., Brinkhaus, H. O., Sorokina, M., Zielesny, A. & Steinbeck, C. DECIMER-Segmentation: automated extraction of chemical structure depictions from scientific literature. J. Cheminform. 13, 20 (2021).
Article CAS PubMed PubMed Central Google Scholar
Morin, L. et al. MolGrapher: graph-based visual recognition of chemical structures. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) (eds Kosecka, J. et al) 19552–19561 (IEEE, 2023).
Ebe, T., Sanderson, K. A. & Wilson, P. S. The Chemical Abstracts Service generic chemical (Markush) structure storage and retrieval capability. 2. The MARPAT file. J. Chem. Inf. Comput. Sci. 31, 31–36 (1991).
Article CAS Google Scholar
Dalby, A. et al. Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited. J. Chem. Inf. Comput. Sci. 32, 244–255 (1992).
Article CAS Google Scholar
Heller, S. R., McNaught, A., Pletnev, I., Stein, S. & Tchekhovskoi, D. InChI, the IUPAC International Chemical Identifier. J. Cheminform. 7, 23 (2015).
Article PubMed PubMed Central Google Scholar
Zhou, C., Liu, W., Song, X., Yang, M. & Peng, X. YoDe-Segmentation: automated noise-free retrieval of molecular structures from scientific publications. J. Cheminform. 15, 111 (2023).
Article PubMed PubMed Central Google Scholar
Huang, Y., Lv, T., Cui, L., Lu, Y. & Wei, F. LayoutLMv3: pre-training for document AI with unified text and image masking. In Proc. 30th ACM International Conference on Multimedia (eds Magalhães, J. et al.) 4083–4091 (Association for Computing Machinery, 2022).
Xu, Y. et al. MolMiner: you only look once for chemical structure recognition. J. Chem. Inf. Model. 62, 5321–5328 (2022).
Article CAS PubMed PubMed Central Google Scholar
Ibison, P. et al. Chemical literature data extraction: the CLiDE project. J. Chem. Inf. Comput. Sci. 33, 338–344 (1993).
Article CAS Google Scholar
Xiong, J. et al. αExtractor: a system for automatic extraction of chemical information from biomedical literature. Sci. China Life Sci. 67, 618–621 (2023).
Article PubMed Google Scholar
Hattori, K., Wakabayashi, H. & Tamaki, K. Predicting key example compounds in competitors’ patent applications using structural information alone. J. Chem. Inf. Model. 48, 135–142 (2008).
Article CAS PubMed Google Scholar
Tyrchan, C., Boström, J., Giordanetto, F., Winter, J. & Muresan, S. Exploiting structural information in patent specifications for key compound prediction. J. Chem. Inf. Model. 52, 1480–1489 (2012).
Article CAS PubMed Google Scholar
Akhondi, S. A. et al. Automatic identification of relevant chemical compounds from patents. Database (Oxford) 2019, baz001 (2019).
Kim, S. et al. PubChem 2023 update. Nucleic Acids Res. 51, D1373–D1380 (2022).
Article PubMed Central Google Scholar
Jablonka, K. M., Schwaller, P., Ortega-Guerrero, A. & Smit, B. Leveraging large language models for predictive chemistry. Nat. Mach. Intell. 6, 161–169 (2024).
Article Google Scholar
Shimizu, Y. et al. AI-driven molecular generation of not-patented pharmaceutical compounds using world open patent data. J. Cheminform. 15, 120 (2023).
Article CAS PubMed PubMed Central Google Scholar
Subramanian, A., P. Greenman, K., Gervaix, A., Yang, T. & Gómez-Bombarelli, R. Automated patent extraction powers generative modeling in focused chemical spaces. Digit. Discov. 2, 1006–1015 (2023).
Article CAS Google Scholar
He, K., Gkioxari, G., Dollár, P. & Girshick, R. B. Mask R-CNN. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) (eds Ikeuchi, K. et al.) 2961–2969 (IEEE, 2017).
Wang, J. et al. Multi-modal chemical information reconstruction from images and texts for exploring the near-drug space. Brief. Bioinform. 23, bbac461 (2022).
Article PubMed PubMed Central Google Scholar
Jurriaans, T. et al. One strike, you’re out: detecting Markush structures in low signal-to-noise ratio images. Preprint at arXiv https://arxiv.org/abs/2311.14633 (2023).
Landrum, G. et al. RDKit: Open-Source Cheminformatics Software http://www.rdkit.org/ (2006).
Tkachenko, M., Malyuk, M., Holmanyuk, A. & Liubimov, N. Label Studio: Data Labeling Software https://github.com/heartexlabs/label-studio (2020–2022).
EPAM. Ketcher https://github.com/epam/ketcher/ (2020).
Favre, H. A. & Powell, W. H. Nomenclature of Organic Chemistry (The Royal Society of Chemistry, 2013).
Google Patents Big Query (accessed January 2024) https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1spatents-public-data!2sgoogle_patents_research!3sannotations.
Papadatos, G. et al. SureChEMBL Bulk Download (accessed January 2024) https://ftp.ebi.ac.uk/pub/databases/chembl/SureChEMBL/data/map/.
Kim, S. et al. PubChem Bulk Download (accessed January 2024) https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/.
Morin, L., Weber, V., Meijer, I., Yu, F. & Staar, P. PatCID: An Open-access Database of Chemical Structures in Patent Documents https://doi.org/10.5281/zenodo.10572870 (2024).
Morin, L., Weber, V., Meijer, I., Yu, F. & Staar, P. Document to Chemical Structure Benchmarks https://doi.org/10.5281/zenodo.10978812 (2024).
Morin, L., Weber, V., Meijer, I., Yu, F. & Staar, P. MolClassifier Training and Validation Datasets https://doi.org/10.5281/zenodo.10978564 (2024).
Morin, L. et al. Molgrapher-synthetic-300k https://doi.org/10.57967/hf/2701 (2024).
Rajan, K., Brinkhaus, H. O., Sorokina, M., Zielesny, A. & Steinbeck, C. DECIMER-Segmentation: Automated Extraction of Chemical Structure Depictions from Scientific Literature. DECIMER-Segmentation-1.2.0 https://doi.org/10.5281/zenodo.7228582 (2024).
Morin, L., Weber, V., Meijer, G. I., Yu, F. & Staar, P. W. J. MolClassifier GitHub: Code, Model and Data https://github.com/DS4SD/MolClassifier (2024).
Morin, L. et al. MolGrapher GitHub: Code, Model and Data https://github.com/DS4SD/MolGrapher (2023).
Morin, L., Weber, V., Meijer, G. I., Yu, F. & Staar, P. W. J. PatCID Code: PatCID-1.0.0 https://doi.org/10.5281/zenodo.12687745 (2024).
Rajan, K., Brinkhaus, H. O., Sorokina, M., Zielesny, A. & Steinbeck, C. DECIMER-Segmentation GitHub https://github.com/Kohulan/DECIMER-Image-Segmentation (2022).
Morin, L., Weber, V., Meijer, G. I., Yu, F. & Staar, P. W. J. MolClassifier code: MolClassifier-1.0.0 https://doi.org/10.5281/zenodo.12687612 (2024).
Morin, L. et al. MolGrapher code: MolGrapher-1.0.0 https://doi.org/10.5281/zenodo.12687408 (2024).
Morin, L., Weber, V., Meijer, G. I., Yu, F. & Staar, P. W. J. MolAnnotator GitHub https://github.com/DS4SD/MolAnnotator (2024).
Morin, L., Weber, V., Meijer, G. I., Yu, F. & Staar, P. W. J. MolAnnotator code: MolAnnotator-1.0.0 https://doi.org/10.5281/zenodo.12687888 (2024).
Qian, Y. et al. MolScribe: robust molecular structure recognition with image-to-graph generation. J. Chem. Inf. Model. 63, 1925–1934 (2023).
Article CAS PubMed Google Scholar
Filippov, I. V. & Nicklaus, M. C. Optical structure recognition software to recover chemical information: OSRA, an open source solution. J. Chem. Inf. Model. 49, 740–743 (2009).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The authors thank Dr. Otto Brinkhaus, Dr. Kohulan Rajan, and Prof. Dr. Christoph Steinbeck for their fruitful interactions.

Author information

Authors and Affiliations

IBM Research, Säumerstrasse 4, 8803, Rüschlikon, Switzerland
Lucas Morin, Valéry Weber, Gerhard Ingmar Meijer & Peter W. J. Staar
Department of Information Technology and Electrical Engineering, ETH Zürich, Sternwartstrasse 7, 8092, Zürich, Switzerland
Lucas Morin & Fisher Yu

Authors

Lucas Morin
View author publications
You can also search for this author in PubMed Google Scholar
Valéry Weber
View author publications
You can also search for this author in PubMed Google Scholar
Gerhard Ingmar Meijer
View author publications
You can also search for this author in PubMed Google Scholar
Fisher Yu
View author publications
You can also search for this author in PubMed Google Scholar
Peter W. J. Staar
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

L.M., G.I.M. and V.W. conceived the document ingestion pipeline and annotated benchmarks. L.M. performed the database evaluation. P.W.J.S. and F.Y. supervised the work. L.M. wrote the first draft. All authors revised and commented on the manuscript.

Corresponding authors

Correspondence to Lucas Morin or Peter W. J. Staar.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Reporting Summary

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Morin, L., Weber, V., Meijer, G.I. et al. PatCID: an open-access dataset of chemical structures in patent documents. Nat Commun 15, 6532 (2024). https://doi.org/10.1038/s41467-024-50779-y

Download citation

Received: 17 February 2024
Accepted: 19 July 2024
Published: 02 August 2024
DOI: https://doi.org/10.1038/s41467-024-50779-y
Springer Nature Limited

PatCID: an open-access dataset of chemical structures in patent documents

Abstract

Similar content being viewed by others

Explore related subjects

Introduction

Results

Data statistics

Document ingestion pipeline

Search evaluation

Document coverage evaluation

Interactive document exploration

Discussion

Methods

Document ingestion pipeline

Document segmentation

Image classification

Molecule recognition

Database evaluation

Benchmarks and metrics

Databases querying

Computational considerations

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation