Layout-Aware Semi-automatic Information Extraction for Pharmaceutical Documents
- 423 Downloads
Pharmaceutical companies and regulatory authorities are also affected by the current digitalization process and transform their paper-based, document-oriented communication to a structured, digital information exchange. The documents exchanged so far contain a huge amount of information that needs to be transformed into a structured format to enable a more efficient communication in the future. In such a setting, it is important that the information extracted from documents is very accurate as the information is used in a legal, regulatory process and also for the identification of unknown adverse effects of medicinal products that might be a threat to patients’ health. In this paper, we present our layout-aware semi-automatic information extraction system LASIE that combines techniques from rule-based information extraction, flexible data management, and semantic information management in a user-centered design. We applied the system in a case study with an industrial partner and achieved very satisfying results.
A significant amount of information in the domain of life science and healthcare is present only in unstructured documents. Discharge letters from hospitals contain important information about the disease and further treatment of a patient. Although many solutions and standards have been proposed for health data exchange, the discharge letter is still the main medium for communication between hospitals and practitioners in Germany. In the pharmaceutical industry, important information about products is described in company core data sheets (CCDS) or summaries of product characteristics (SmPC) . These documents contain information about the usage of medicinal products, their risks and adverse effects, their ingredients, etc. The documents have to be maintained by the pharmaceutical company responsible for the manufacturing of the product and have to be provided to the various national authorities for licensing the product. They are also the basis for the package inserts that are provided to the users of the pharmaceutical product. Thus, the same documents have to be maintained in different languages.
Pharmaceutical companies face now the challenge that they have to provide structured information about their products because of the upcoming ISO IDMP standard (Identification of Medicinal Products), which is currently being implemented also by the European Medicines Agency (EMA)2. The IDMP standard is basically a huge data model in which information about medicinal products can be represented in a structured form.
The content of the documents is multi-lingual, text fragments with different languages cannot be clearly separated.
Relevant information is contained in text fragments with a specific layout, e.g., tables for adverse effects.
Some documents, especially the manufacturing licenses, are only available as scanned paper documents; thus, OCR3 errors are likely to appear.
Although the documents follow common guidelines or outlines, their structure might be still irregular to some degree. Furthermore, the system should be extensible also for other document types, containing new information items to be extracted.
The extracted information needs to be consistent with all the present documents that have been submitted to the authorities.
The information provided to EMA needs to have a very high accuracy as incorrect or incomplete information might have legal consequences.
Various approaches for information extraction from package inserts or similar documents have already been proposed [5, 8, 11, 15, 19, 20], but they usually rely on Natural Language Processing (NLP) techniques as their focus is on extracting information from natural language text. However, these approaches ignore the fact that the documents have a high regular structure and follow a certain layout to present information. Therefore, a layout-aware information extractionapproach seems to be more promising in this context.
In addition, the terms used in these documents are often terms defined in a controlled vocabulary. For example, MedDRA® is a dictionary that is used by authorities in the pharmaceutical industry for adverse effects. It contains international medical terminology developed under the auspices of the International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH)4. Terms extracted from the documents should be matched with the vocabulary terms.
Finally, the approach needs to have an integrated data quality management component, as a high accuracy of the extracted information is required. The prototype system which we developed for a pharmaceutical company, includes an interactive graphical user interface in which the extracted information can be easily verified.
In this paper, we present the overall architecture of our system LASIE (Layout-Aware Semi-automatic Information Extraction) that provides semi-automatic support, focus on the flexible and extensible rule-based extraction system, and discuss the design rationale in developing our prototype. It can process the above mentioned documents (CCDS, Manufacturing Licenses), but it is not limited to these document types as the information extraction can be easily adapted by modifying the extraction rules. Due to confidentiality reasons, we cannot present the details about the datasets used during system development, but we have reconstructed some documents to present the main ideas of our approach.
The paper is structured as follows: the next section gives an overview of our approach, the system architecture, and some details on the datasets. Section 3 explains the main components of the LASIE. The user interface for the integrated data quality management is presented in Sect. 4. Related work is discussed in Sects. 5 and 6 concludes the paper.
2 System Overview
The requirements stated in the introduction imply that the system needs to be extensible and flexible to handle various types of documents and to be able to extract several information items from a document. Therefore, the features flexibility and extensibility are immanent features of all components of our system architecture, which is shown in Fig. 2.
The workflow of the system according to Fig. 2 can be described as follows: The Data Management Service (DMS) lies at the heart of our system. All modules exchange information with the DMS, which relays the information as structured data objects into the database storage. The principal workflow can be described by three steps: first parse the uploaded input documents for their structure, extract domain-specific information with our rule engine and finally present the results in the web-based user interface for interactive verification of the results.
The next section will describe the components in more detail, but we will first describe the documents that we processed in the case study with our industrial partner.
2.1 Documents in Our Case Study
We verify our system using two sets of documents, provided by our industry partner. The first document type contains detailed medical leaflets from which adverse effects shall be extracted, which were mostly noted using MedDRA®.
MedDRA® is hierarchically organized into five levels: The lowest level describes very specific medical terms as observed in practical use, while each higher level groups several lower level terms into a more general description. An example of the MedDRA® hierarchy can be seen on their website5. The highest level contains the system organ classes, which group disorders into different ‘functional’ areas of the human body, e.g., eye disorders, nervous system disorders.
Company Core Data Sheets (CCDS). The CCDS files are either stored in doc- or docx-file format and contain information about the contained substances, usage information as well as tables containing adverse effects among others. Our goal is to find all adverse effects and enrich the terms with further information like their MedDRA® ID, the system organ class they belong to as well as the frequency of their occurrence.
Manufacturing Licenses (ML). Although ML documents from different countries share similar information, including the license holder, place of issue and concerned product, the way the information is presented varies a lot. Some documents use tables to show the information, while others use regular text (such as the French document template shown in Fig. 1).
In addition, our dataset contained MLs issued in different European languages, sometimes two in one document as mentioned in the introduction. The documents were often bad quality scans including a few image elements like logos, seals and handwritten signings that might obscure printed text. This results in subpar OCR processing. Contained tables were of implicit nature, i.e., they could not be directly recognized from the scanned PDF files.
Overall, we used 60 ML documents of almost ten different languages.
3 Main Components of LASIE
3.1 Document Parsing
The goal of document parsing is to understand the documents layout and map it to a structured data model. Here, we map our semi-structured documents to single document entities according to a previously defined data model. The extracted document elements are saved as data entities into the database described in Sect. 3.3.
Although there are several data models for documents (e.g., OOXML used by MS Word, ODF used by LibreOffice, or HTML for documents in the web), we apply a simpler model that focuses on the main elements for the layout of a document (e.g., section headings, tables). In addition to the basic elements of a document, our data model has a concept for bounding boxes. A bounding box is the virtual box that surrounds a word or other document fragments. It provides also the X-Y-coordinates of the text element on the page.
This information is important for PDF documents that have been produced by scanning paper documents as such documents just have the information which words are on a page and where these words are located, but usually words are not grouped into paragraphs or tables in these documents. Another example are documents in which table-like structures have been created with tabs and spaces. In all these cases, it is important to know which words are next to, right of, or below other words.
The main elements of our document data model are shown in Fig. 4. We formalized the data model as XML schema, but we just use it as guideline for the JSON representation
Company Core Data Sheets. Since MS Word documents follow a very similar structure as our XML schema, document object extraction is very straight forward. Most objects can be directly derived from the document formats. Doc/docx-documents are rendered on-view, i.e., the files do not contain any relationships between text elements and pages8. Furthermore, things like the section headers’ numbering can be derived directly from the file. A section’s numbering can be determined through a simple counting loop, though. Since our data model requires the number of pages to match found terms to the contained page, we create pdf-versions of our doc/docx-files. We also use them to determine the exact word positions and bounding boxes on the pages for presentation of the results.
Manufacturing Licenses. All MLs were scanned and OCR-processed documents of varying quality, including printed and handwritten text, stamps and logos. The OCR processing was done automatically with commercial software ahead of feeding it into our system. As already mentioned, some of them contained two languages alternating with each other, which the software did not always recognize correctly. As the aim of this project was to extract information as is, we did not work on improving the OCR quality. From other projects we know, that the current commercial tools provide a good quality which is difficult to improve in general.
Opposite to Word files, we start with extraction of words and their bounding boxes. In the next step, words are concatenated into single lines and assigned to pages. Due to complex and variable layout of our pdf-files, the recognition of tables and table-like structures is handled by the rule engine described in Sect. 3.2.
3.2 Rule Engine
The rule engine of LASIE is based on the physical structure properties of the documents, where the layout properties are used for extracting information items. One key part in order to keep a modular and flexible system is to separate the rules from the rest of the rule-based framework. The rules for extraction are expressed in a form of declarative programming, i.e., the task and the output of the process is defined. We use DROOLS9 as rule engine. DROOLS provides a good object-oriented interface in Java (which we used as main programming language in our system) with easy access to elements and properties of rules and facts in the rule engine. One very important feature is the ability to load rules during run-time; thus, changing the extraction rules does not require a recompilation of the software. This was an important requirement of our industrial partner.
Mapping from text to semantics: In this case a set of words which need to be mapped is given, e.g. the term vomiting, diarrhea, headache.
Identify terms from dictionary in a paragraph: There the goal is to find the longest possible concatenation of tokens from the paragraph where there are still results in the dictionary.
In the first case, the words are separated by commas and mapped to terms in the dictionary. If the word is not found in the dictionary, a string similarity search is used to propose the user some suggestions. An overview of different approaches for string similarity is given in . There two categories of algorithm, character-based and token-based are distinguished. We used one algorithm for each group, the Jaro-Winkler Algorithm for the first and the Jaccard similarity algorithm for the second one, as these algorithms performed best for our cases. Afterwards, a similarity index for both algorithms is calculated. If this value is above a certain threshold, we assume a possible match.
String similarity approaches are also applied in the second case, where the longest possible concatenations which matches the dictionary have to be found. There, in case of a high string similarity, the terms are added to the final result.
The rules for the rule engine are defined in separate files, which can be grouped in preprocessing, extraction and output. For the definition of the rules, it is important that the order of applying the rules do not affect the result, due to pattern matching principles which are usually used by rule engines. Moreover, the rules have to respect the hierarchy of elements in the document, i.e., constructive rules which do not break the current hierarchy but only improve or enrich it, have been generated.
Prepare a session: The rule engine finds all rule files for a given session name (e.g., the type of documents to be processed).
Load the rule files.
Compile the rules; this is performed by the DROOLS framework.
Initialize the working memory of the session with the data, i.e., the document structure created by the document parser is inserted as factsto the rule engine.
Fire the rules. This process is based on the conditions of each rule. Rules can be fired multiple times, and the process will continue until no rule can be fired anymore.
In the last step, the result from the working memory is obtained (i.e., the derived facts or extracted information) and transformed into a data object for the data management service. Here, it is important to note that the structure of the resulting data object cannot be defined in general beforehand. This depends on the type of information to be extracted. As the rules can be modified by the users of the system, they can also change the structure of the resulting data object. Thus, the schema of the data object needs to be flexible, which we will discuss in the next subsection.
For interpreting the results of the rules, a candidate score approach is used. That means each extracted information by the rules is considered as candidate answer which gets a value between 0 and 100 based on the properties or conditions. The final result is then a collection of several candidate items with corresponding scores, which can be used in the User Interface to present the results.
3.3 Data Management Service
Our system consists of separate independent modules that exchange data via the central data management service (DMS). As stated above, the system should be extensible and adaptable for other use cases and not limited to the specific document types considered initially in the project.
Therefore, the aim was to provide an easy to use, extensible common data structure, that should be able to represent different types of data objects. For example, the data structure needs to be able to represent the document structure as described above by the XML schema as well as the result objects of information extraction. In case of CCDS documents, this is a list of adverse effects with links to their corresponding representation in the MedDRA dictionary.
One possible approach could have been to define an XML schema (or JSON data structure) for each of the relevant data objects. However, as discussed above, this is not possible as the rules for information extraction can be changed by the user and the structure of the extracted can vary between different rule sets. Therefore, we decided to design a generic structure of data objects which can represent any kind of data model.
ID: the ID of the object,
type: a string denoting the type of the object,
version: a version number,
relationships: a map that relates a relationship name with a list of related objects, and
properties: a map that relates a property name with a list of values.
The DMS provides storage and retrieval mechanisms for data objects. Each module only uses the interface of the DMS for data access. In our prototype system, we decided to use MongoDB as the underlying storage system, but the generic data structure could be also mapped to other data management systems.
The data objects in our model are immutable, that means that each change to an object will create a new version of an object. Thus, the database might contain objects with the same ID but with different versions. One reason for this is to enable traceability of all changes that have been applied to data objects. This is especially important for the interaction in the user interface; it should be possible to trace the changes and to see who is responsible for which change.
To access the data, the DMS provides several query functions, e.g., retrieving an object with a specific ID, retrieving all objects of a certain type, or retrieving objects with certain properties or relationships. These query functions are wrapped as REST services such that the web-based user interface can easily access the data that has been generated by the document parser and the rule engine.
4 Web-Based User Interface
On the left-hand side of the UI, the original document is displayed, with adverse effects found framed in different colors indicating their annotation status. The annotations are shown on the right side and can be changed by the user. Both parts of the UI are linked, i.e., when the user clicks on a frame on the left side, the corresponding extracted information item will be shown on the right and vice versa. This enables easy verification of the extracted information. Once a document has been checked by two experts and marked as ‘accepted’, its terms are permanently accepted into the database.
5 Related Work
Very similar work to our overall approach of layout-aware information extraction has been presented in . The described system automatically determines posology, side effects and indications from Portuguese medical leaflets by NLP methods. The design is based on six steps: text preprocessing, a document reader, a general natural language processing module, a named entity recognition, relation extraction and an information consumer step.
However, for a well working automatic classification tool usually a huge amount of pre-labeled training data is necessary. In case of low quality data, the accuracy of these systems usually decreases drastically. Since in our application, the accuracy and understandability of the extraction rules are some of the key points of the system, we focused on a rule-based system with user interaction to enable a very high accuracy independent of the quality or availability of training data.
In the following, we will briefly discuss related work to the different components of our system.
Information Extraction. Information Extraction (IE) has been extensively studied since the late 1980s . It describes the process of automatically retrieving structured information from machine-readable, semi- or unstructured documents. Structures include types like (named) entities, relationships between two entities, layout entities of documents and ontologies.
Typically, the extraction of information is done in two steps: first extract domain-unspecific structure from the documents and then use the results to extract domain-specific elements. We also follow this general pattern and divide the process in document parsing and rule-based extraction.
Document Parsing. Document parsing is domain-independent and once implemented it can be applied to new sources with similar layout without major changes. Partially structured documents, like newspaper articles, journal papers, technical documentation and medical leaflets, often follow an already defined rough format style. With file formats like doc/docx and html it is relatively easy to extract structures like sections, section headers, paragraphs or tables, as the file format already incorporates tags describing those. More refined methods are needed to classify such structures from PDF documents and simple text files.
Hierarchical document parsing via top-down approaches, often using image processing methods, have been first proposed by [6, 16] and more recent by . Table understanding has been studied independently as part of general document recognition. A comparison of various techniques can be found in .
Rule-Based Information Extraction. Opposite to statistical methods for information extraction, rule-based methods are easy to formulate and understand by a human reader.
There are two possibilities to obtain rules: The first one is to use rule-based systems or to use learning systems. At the very beginning rule-based systems were used. Nowadays more and more learning systems appear. Statistical methods to transform the task into a classification task use for example the Hidden Markov Model, Conditional Markov Model or Random Fields, see . However, for well-working learning systems usually enough training data has to be available. Moreover, even though for rule-based approaches it might be hard to define every case, they can easily be implemented. A further advantage is, that the process and not the way to achieve the goal is coded, which enables an easier expression of the solution. Furthermore, rule-based systems are fast and can be easily optimized. Examples for systems based on rules are GATE , TextMarker  and SystemT .
Data Management. For an efficient data management, it is especially important to use a suitable data model to efficiently support queries. The most crucial point there is to use a generic data model as we cannot fix the schema of the extracted data beforehand. In , a generic data model is presented. There, a generic model management which serves as an abstraction of particular metamodels and preserves as much of the original features of modeling constructs as possible is developed. As the goal there is to support model management operations, the generic data model focuses on a detailed representation of features of different modeling languages. Here, we aim at providing a flexible and extensible representation of data objects and do not deal with the details of modeling languages.
6 Conclusion and Discussion of Results
In this paper, we presented a complete system to extract specified information from medical documents, using efficient methods for extraction, data storage & retrieval and reviewing. We applied the system to real use case of our industrial partner.
The results from this use case and the feedback from our industrial partner were very promising. In contrast to a complete manual extraction process, the proposed system provides a repeatable and traceable extraction procedure. Especially, the web-based UI with an integrated visualization of source document and extracted information was considered as an important component of the system. Such a link between extracted information and source data cannot be easily established in a manual approach.
We have shown that information extraction from documents is possible with adaptable methods. Data quality in our approach is high as the extracted data can be matched with controlled vocabularies.
The rule-based extraction process was also able to reveal inconsistencies in the source documents. For example, some documents contained inconsistent information (value X for a certain property was stated on one page, whereas another value Y was stated for the same property on another page). Other examples of revealed problems in the source data was the use of an outdated terminology in the documents, as controlled vocabularies also evolve.
With the data management service and the generic data model, we developed a flexible framework for data processing. This service is also a core component of other projects in our group, as it provides an easy to use yet efficient way to manage data. Furthermore we intend to extend the data management framework by an common and flexible query mechanism and by an meta data enrichment.
An interesting feature for future work is to use the user input to learn rules for the extraction process. If a user edits the extracted information always in the same way, this might be expressed in a rule.
The example has been taken from http://agence-tst.ansm.sante.fr/html/pdf/3/expor.pdf which is actually an export license of the French authority (ANSM). The manufacturing licenses which we considered in our use case had a similar structure; due to reasons of confidentiality, we cannot show the documents which we processed.
Optical character recognition.
MedDRA® trademark is owned by IFPMA on behalf of ICH. There are other medical terminology systems (or ontologies) available, but we have to use MedDRA® as it is the terminology required by the authorities.
The final layout of such documents depend on many factors, including especially the settings of the selected printer. Thus, the layout of a certain page is not stored in the file, but only created when the document is rendered on a screen or printer.
This work has been partially funded by the German Federal Ministry of Education and Research (BMBF) (project HUMIT, http://humit.de/, grant no. 01IS14007A).
- 1.Aguiar, B.L., Mendes, E., Ferreira, L. Information extraction from medication leaflets. Ph.D. thesis, Master thesis, FEUP, Porto (2012)Google Scholar
- 2.Bakiu, B.: Layout-aware semantic information extraction from semi-structured documents. RWTH Aachen University, Master (2015)Google Scholar
- 3.Chiticariu, L., Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F.R., Vaithyanathan, S.: SystemT: an algebraic approach to declarative information extraction. In: Proceeding 48th Annual Meeting Assocation Computational Linguistics, ACL 2010, pp. 128–137. Association for Computational Linguistics, Stroudsburg, PA, USA (2010)Google Scholar
- 4.Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., Aswani, N., Roberts, I., Gorrell, G., Funk, A., Roberts, A., Damljanovic, D., et al.: Developing language processing components with GATE version 7 (a user guide). University of Sheffield, UK (2013). https://gate.ac.uk/sale/tao/index.html
- 5.Duke, J.D., Friedlin, J.: ADESSA: a real-time decision support service for delivery of semantically coded adverse drug event data. In: AMIA Annual Symposium Proceedings, vol. 2010, 177–181 (2010)Google Scholar
- 6.Ejiri, M.: Knowledge-based approaches to practical image processing. In: Industrial Applications of Machine Intelligence and Vision (MIV-89), Tokyo, 10–12 April 1989, p. 1 (1989)Google Scholar
- 7.Gao, L., Tang, Z., Lin, X., Liu, Y., Qiu, R., Wang, Y.: Structure extraction from PDF-based book documents. In: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital libraries, pp. 11–20. ACM (2011)Google Scholar
- 8.Ge, C., Zhang, Y., Duan, H., Li, H.: Identification of adverse drug events in chinese clinical narrative text. In: Park, J.J.J.H., Pan, Y., Chao, H.-C., Yi, G. (eds.) Ubiquitous Computing Application and Wireless Sensor. LNEE, vol. 331, pp. 605–612. Springer, Dordrecht (2015). doi: 10.1007/978-94-017-9618-7_62 Google Scholar
- 9.Gobel, M., Hassan, T., Oro, E., Orsi, G.: ICDAR 2013 Table Competition. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 1449–1453. IEEE, August 2013Google Scholar
- 10.Gomaa, W.H., Fahmy, A.A.: A survey of text similarity approaches. Int. J. Comput. Appl. 68(13), 13–18 (2013)Google Scholar
- 11.Iqbal, E., Mallah, R., Jackson, R.G., Ball, M., Ibrahim, Z.M., Broadbent, M., Dzahini, O., Stewart, R., Johnston, C., Dobson, R.J.B.: Identification of adverse drug events from free text electronic patient records and information in a large mental health case register. PLoS One 10(8), e0134208 (2015)CrossRefGoogle Scholar
- 13.Kluegl, P., Atzmueller, M., Puppe, F.: TextMarker: a tool for rule-based information extraction. In: Chiarcos, C., de Castilho, R.E., Stede, M. (eds.) Proceedings of the Biennial GSCL Conference 2009, 2nd UIMA@GSCL Workshop, pp. 233–240. Gunter Narr Verlag (2009). http://ki.informatik.uni-wuerzburg.de/papers/pkluegl/2009-GSCL-TextMarker.pdf
- 14.Lafferty, J.D., McCallum, A., Pereira, F.C.N., Fields, C.R.: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2001)Google Scholar
- 16.Nagy, G., Seth, S.: Hierarchical representation of optically scanned documents. In: International Conference on Pattern Recognition, vol. 1, pp. 347–349 (1984)Google Scholar
- 19.Thompson, C.A., Califf, M.E., Mooney, R.J.: Active learning for natural language parsing and information extraction. In: ICML, pp. 406–414 (1999)Google Scholar