Protégé: A Tool for Managing and Using Terminology in Radiology Applications
- 1.7k Downloads
The development of standard terminologies such as RadLex is becoming important in radiology applications, such as structured reporting, teaching file authoring, report indexing, and text mining. The development and maintenance of these terminologies are challenging, however, because there are few specialized tools to help developers to browse, visualize, and edit large taxonomies. Protégé (http://protege.stanford.edu) is an open-source tool that allows developers to create and to manage terminologies and ontologies. It is more than a terminology-editing tool, as it also provides a platform for developers to use the terminologies in end-user applications. There are more than 70,000 registered users of Protégé who are using the system to manage terminologies and ontologies in many different domains. The RadLex project has recently adopted Protégé for managing its radiology terminology. Protégé provides several features particularly useful to managing radiology terminologies: an intuitive graphical user interface for navigating large taxonomies, visualization components for viewing complex term relationships, and a programming interface so developers can create terminology-driven radiology applications. In addition, Protégé has an extensible plug-in architecture, and its large user community has contributed a rich library of components and extensions that provide much additional useful functionalities. In this report, we describe Protégé’s features and its particular advantages in the radiology domain in the creation, maintenance, and use of radiology terminology.
Key wordsOntologies terminologies vocabulaires RadLex software tools
The language of biomedicine is rich and diverse, and many domains recognizing the need to standardize the language they use have been developing terminologies and ontologies. Terminologies and ontologies are related. Terminologies (also called “vocabularies” and “lexicons”) are collections of labels or terms (or “entities”) particular to a subject field or domain of human activity, developed for the purpose of documenting and promoting correct usage. Ontologies are similar to terminologies, but generally contain rich knowledge about their terms (through attributes and relations) with the goal of describing the entities that exist in a domain. Many terminologies and ontologies have been appearing throughout biomedicine, including genomics,1 molecular biology,2 anatomy,3 clinical research,4 and clinical care.5
The radiology community has also recently recognized the need for developing a standardized terminology. Hospitals contain a diversity of computerized information systems, and they often refer to the same procedures, findings, and diagnoses using different terminologies. The lack of standards in terminology conventions creates a barrier for comparing studies across enterprises and among health organizations, and it also hampers research because it is difficult to retrieve indexed case material in a consistent manner.
RadLex is a project to create a comprehensive medical imaging terminology for capturing, indexing, and retrieving a variety of radiology information resources.6 RadLex is being developed by acquiring the terms relevant to the radiology subdisciplines (abdominal, cardiovascular, musculoskeletal, pediatric, thoracic, and neuroradiology). The ultimate goal is to distribute RadLex widely in electronic form so that it can be incorporated into applications such as structured reporting, teaching file authoring, and indexing research data.
RadLex was initially developed using word processors and spreadsheets. However, it soon became evident that these tools were too limited for accessing and managing RadLex. First, as RadLex is a large taxonomy, it is difficult to visualize and navigate RadLex using conventional tools such as word processors and spreadsheets. Second, RadLex contains many attributes associated with each term, and it is difficult to maintain this information in correct structured form as RadLex grows. Third, although text documents and spreadsheets were a simple initial format for collecting the RadLex terms, they are not suitable for distributing RadLex on the Web or for making it accessible to applications. A tool tailored to the needs of terminology development and dissemination in radiology is needed.
Protégé (http://protege.stanford.edu)7 is an open-source tool for editing and managing terminologies and ontologies. It is the most widely used domain-independent, freely available, platform-independent technology for developing and managing terminologies, ontologies, and knowledge bases in a broad range of application domains. The community of registered Protégé users exceeds 70,000. In addition, the large and active Protégé user community is highly engaged in Protégé code development, regularly contributing enhancements to the software (http://protege.stanford.edu/community/wiki.html), as well as participating in online discussion groups devoted to modeling questions, technical-support issues, and requests for new features (http://protege.stanford.edu/community/archives.html).
Protégé has been used as the primary development environment for many ontologies in the life sciences. These projects include the Foundational Model of Anatomy,3 Cerner’s Clinical Bioinformatics ontology,8 the DICE TS,5 and the MGED Ontology.2 Recently, the RadLex project decided to adopt Protégé, and RadLex is currently distributed in Protégé XML format (http://radlex.org/radlex/docs/downloads.html).
In this report, we introduce Protégé and highlight its value to managing terminology in the radiology domain. Besides providing a suite of functionality that helps curators edit and manage terminologies such as RadLex, Protégé provides an extensible architecture to which custom functionality specific to radiology needs can be added, and it also provides a platform for deploying radiology applications that exploit terminologies such as RadLex.
Protégé Knowledge Model
Protégé is a suite of tools for ontology development and use. Before describing the tool itself, it is important to understand how terminologies and ontologies are built and stored in computers. Briefly, terminologies comprise lists of terms (also called “entities”), and they may also specify attributes (also called “slots”) for those terms (for example, RadLex contains a term called supraaortic valve area, which has an attribute Version Number). Ontologies are similar to terminologies, but contain rich relationships amongst terms, enabling them to represent knowledge in a domain (for example, a relationship SegmentOf linking supraaortic valve area to thoracic arota represents the fact that the supraaortic valve area is a segment of the thoracic aorta).
Axioms specify additional constraints, but will not be described further here as they are not used in RadLex at this time. An instance is a frame built from at least one class (“instantiation”) that carries particular values for the slots. A “knowledge base” includes the ontology (classes, slots, facets, and axioms) as well as instances of particular classes with specific values for slots. The distinction between classes and instances is not an absolute one; however, because terminologies generally do not contain instances, we can simplify and limit our discussion to classes.
Classes, slots, and facets are the basic building blocks of terminologies. A tutorial on using these elements and Protégé for creating terminologies and ontologies is available10 and will not be repeated here. Instead, we will focus on the features of Protégé that are particularly relevant to accessing, editing, and sharing ontologies and using them in applications.
Architecture of Protégé
In addition, Protégé has a plug-in architecture, permitting developers to extend Protégé’s core functionality in many ways without needing to modify the Protégé source code. There are more than 90 Protégé plug-ins providing advanced capabilities such as import/export, validation, and visualization of large ontologies (http://protege.cim3.net/cgi-bin/wiki.pl?ProtegePluginsLibraryByTopic), some of which we discuss below.
Import and export of ontology content is a critical feature of any tool, because there are many formats for storing ontologies and terminologies. There are several plug-ins for Protégé, enabling it to import and export ontologies in many different formats, including XML, RDF, OWL, and the native Protégé format.
If the ontology is stored in a file, the entire ontology is read into memory. Protégé also provides a relational database backend, which is useful when working with ontologies too large to reside in memory. Finally, through Protégé’s plug-in mechanism, developers can create their own custom import/export plug-ins to work with custom or specialized formats.
Protégé Application and Graphical User Interface
Protégé can be run as a stand-alone application or through a Protégé client in communication with a remote server (the latter is particularly useful for collaborative ontology development). When the Protégé desktop application is launched, the user can create a new ontology, open an existing ontology, or import an ontology from a variety of formats. Protégé creates a project file that records display-related information and the location of the ontology source file; the project file simplifies the process of reopening ontologies, and it maintains user GUI customizations.
The tree view of RadLex can facilitate the process of curating the terminology to find problems to be corrected, because the various hierarchies can be easily expanded and browsed to detect inconsistencies. In this tree view, for example, we see that the supraaortic value area is a child of thoracic aorta (Fig. 4; “supraaortic value area is-a thoracic aorta”). As it is not correct to say that “supraaortic valve area is-a thoracic aorta,” the RadLex ontology should be changed in a future release (what was likely intended was to say “supraaortic value area part-of thoracic aorta,” which could be reflected by changing the Part-of value).
The classes tab permits full editing of the ontology. In the tree view of the ontology on this tab, users can drag and drop classes to reorganize the hierarchy, create and rename classes using intuitive GUI paradigms. The values for attributes of classes can also be directly edited. For example, we could change the name of the class, left coronary sinus, by simply selecting this class and changing the value of the “Preferred Name” slot in the GUI (Fig. 4, right panel).
The other tabs accessible from the Protégé GUI are the slots, forms, and instances tab, which are less often used by those curating terminologies. The slots tab enables users to view and edit all slots in an ontology; the instances tab displays all instances associated with the classes; and the forms view permits users to customize the layout of the elements on display forms.
Ontologies change over time. They are developed iteratively, with incremental changes contributed by people working collaboratively on them. RadLex will soon be releasing a new version, adding new terms as well as changes to the original terminology based on suggestions from users of the first version. There is a need to maintain and compare versions of ontologies, similar to the need for document version management in large projects. Document versioning systems enable users to compare versions, examine changes, and accept or reject changes. However, a versioning system for ontologies must compare and present structural changes rather than changes in the text of a document.
PROMPT is a plug-in to Protégé providing a suite of tools for aligning and comparing two ontologies. PROMPT contains PROMPTDIFF, a versioning environment for ontologies, allowing developers to track changes in ontologies over time.11,12 PROMPTDIFF is a version-comparison algorithm that produces a structural difference between two versions of an ontology. The results are presented to the users enabling them to view classes that were added, deleted, and moved. The users can then act on the changes, accepting or rejecting them.
supraaortic valve area was renamed to supraaortic area
supraaortic area was moved from being under thoracic aorta to being under anatomic location.
noncoraonary sinus is a new class, under supraaortic area
thoracic aorta was added as a value for the Part-of relationship on the supraaortic area class.
In addition to viewing versions of ontologies, PROMPT also includes a tool to align two different ontologies to locate classes that are the same or similar to each other. Such functionality could be useful in RadLex for finding related terms in other vocabularies such as SNOMED and the ACR index.
Ontology Visualization Components
OntoViz is a plug-in to Protégé that creates graph visualizations of ontologies (http://protege.cim3.net/cgi-bin/wiki.pl?OntoViz). OntoViz produces its graphs using the Graphviz software package.13 The types of visualizations are highly configurable and include ability to (1) visualize part of an ontology by selecting specific classes and instances, (2) display slots and slot edges, (3) customize the graph by specifying colors for nodes and edges, and (4) visualize the neighborhood of classes for particular classes or instances by applying various operators (e.g., subclasses, superclasses).
Collaborative Ontology Development
A key challenge for maintaining a large terminology such as RadLex is to collect and process the feedback from the large community of users. The current model for feedback on terminologies and ontologies is for users to contact the developers or to post comments to online discussion forums. The problem with this approach is that the user feedback is disconnected from the ontology under discussion; it is difficult for the ontology developer to keep track of these discussions, to prioritize the feedback, and to decide which parts of the ontology need updated first.
The Collaborative Protégé components provide search and filtering of user annotations based on different criteria. There are also two types of voting mechanisms that can be used for voting of change proposals. All annotations made by one user are seen immediately by other users, enabling a community-based discussion about ontologies to be directly linked to the applicable portions of the ontologies.
Despite the variety of plug-ins available for Protégé, many users will have custom needs related to how they work with ontologies, which will require direct programmatic access to the ontology content. For example, a RadLex curator may wish to find the classes that have missing values for slots (currently, many SNOMED and UMLS identifiers have not yet been collected for RadLex terms). To find these classes in RadLex, a program would be needed to traverse all classes in the RadLex ontology, looking for those that have missing values for the slots in question.
In addition to the Script Tab, Protégé provides an API. System developers can use the Protégé API, a Java interface to access and programmatically manipulate Protégé ontologies—this is the same API that the Protégé GUI uses to access ontologies (Fig. 3). Developers can access the Protege Java API from their Java programs by calling the appropriate methods to access the ontology in Protégé and to perform the desired operations. Documentation for guiding developers in using the Protégé API for application and plug-in development is available (http://protege.stanford.edu/doc/dev.html).
While Protégé is a useful tool for working with ontologies and terminologies, the ultimate reason to be creating these artifacts is to use them in applications. For example, if one wanted to use RadLex in a structured reporting application to enforce use of controlled terminology, it would be necessary for that application to access terms in the ontology. Protégé can be integrated with many applications, connecting external programs to ontologies via the Protégé API (as described above). In this section, we highlight an example application relevant to RadLex that was created in this manner—WebProtege.
A large terminology such as RadLex needs to be both widely available and locally editable. A common method for distributing a terminology is via the Web; however, once a user downloads the terminology, it will be become out of date as soon as a new version is posted. An alternative method for disseminating RadLex is to provide a Web-based program for viewing the terminology. In fact, RadLex is currently made available through a dedicated Web application (http://www.radlex.org/radlex/). However, because the version of RadLex that is deployed in the Web application is separate from the working draft, it is challenging to keep the development and production versions of RadLex tightly synchronized.
An approach whereby a single version of RadLex could be edited by curators and disseminated on the Web could be advantageous to make a current working draft available. We undertook to implement this strategy by adopting WebProtege, an ontology-enabled application built on the Protégé platform. The goal was to distribute Radlex in a Web application while providing the means to keep it synchronized with the version in development.
The Protégé architecture allows multiple applications as well as Protégé itself to access the same ontology (see Fig. 3). Thus, the maintenance and coordination of the terminology is greatly facilitated, because there need be only one copy of the terminology residing in a central location that all applications access through the Protégé API. A RadLex developer can use the Protégé GUI to edit the terminology, while applications that need RadLex use the Protégé API, accessing the same version that the developer has edited, and users browsing RadLex through WebProtege see the same version of the terminology as well.
In addition to the WebProtege application that we have described here, other applications that use ontologies can be built using similar principles. The common denominator is that in order for an application to use an ontology, the appropriate methods in the Protégé API are called to get references to classes (terms) and their attributes (slots).
The need for terminologies and ontologies in radiology is becoming increasingly apparent. Variation in language used in reporting the results of imaging studies is a recognized challenge,16 which can be mitigated by adopting standard terminologies. Radiology educators need a comprehensive lexicon for indexing online educational materials.6 Radiology researchers need their data to be indexed with terminologies so that search and data mining can be more effective. Ontologies can be used to create new intelligent Picture Archiving and Communication System applications.17
Visualization of terminologies: Terminologies can be large. Users need to be able to browse terminologies compactly, collapsing entire trees to hide the complexity from users. Different paradigms for visualization (tree view, graph view, etc) are desirable because different ways of viewing large terminologies could be preferable under different circumstance (tree views give a good overview of taxonomies; graphs are suitable for displaying multiple relationships among terms).
Terminology versioning: It is useful to show how terminologies changed between versions. Traditional “document diff” methods that show text differences are not useful for terminologies because of their complex structure. Specialized tools are needed to show how terminologies changed between versions (term moving to a different place, being split into two or more new terms, etc).
User feedback management: User feedback drives the evolution of terminologies. It is difficult to keep track of all the requested changes to a terminology, especially as the volume of feedback expands. Methods are needed to track the type of comments being made, the terms to which they apply, and whether and how feedback was handled. It is desirable for the feedback to be publicly accessible so that the details of the evolution of the terminology are transparent to the community.
Terminology dissemination: Applications need to be able to access the current version of the terminology as soon as new versions are produced. It can be cumbersome for developers of applications that use a terminology such as Radlex to keep their version of the terminology current as RadLex evolves; it would be preferable to avoid the need to modify of update applications when new versions of the terminology are released.
Application development: Ultimately, a terminology is only useful if it can be easily incorporated into applications. An application should be able to reference terms in the terminology using simple methods, and application developers should not need to understand the arcane details of terminology storage formats.
In this report, we have described Protégé, an open-source tool that can help users of terminologies and ontologies to develop them and use them in applications. Protégé provides tools to support the functional requirements for terminology developers and consumers as described above. The Protégé tools range from creating the terminology (GUI, visualization, and versioning support) to deployment and application development (user feedback management, API, and application support). We have illustrated these functions using RadLex as an example terminology. In fact, the RadLex project has adopted Protégé for managing the terminology, and Protégé has been useful in meeting the terminology curation needs of the RadLex project to date.
Although Protégé provides functionality that meets many needs of the user community, there are still some unmet challenges. First, terminologies evolve over time, and terms that are used in applications may be retired in future versions of the terminologies. When a term is deleted from a newer version of the terminology, applications that use an older version of the terminology need to be provided an appropriate replacement term.
A second challenge is that the compact visualizations provided by Protégé may not be sufficient in cases of very large terminologies. Although there are some visualization paradigms for large ontologies, it could be helpful to filter the terminology or produce a compact “view” of the ontology specific to context of the user. A final challenge is that users new to terminologies and ontologies may experience a learning curve in becoming acquainted with the concepts and learning how to create terminology-enabled applications. The Protégé web site contains tutorials and introductory material that can help the community get up to speed with this tool and its use in applications.
This work was supported by the National Center for Biomedical Ontology, under roadmap-initiative grant U54 HG004028 from the National Institutes of Health. This work was also supported by the Protégé resource, under grant LM007885 from the U.S. National Library of Medicine. We acknowledge the highly talented Protégé research and development team, Tim Redmond, Samson Tu, Tania Tudorache, and Jennifer Vendetti, and Ray Fergerson for developing the Protégé system. We also thank the Protégé user community for their outstanding contributions to the Protégé platform.
- 8.Hoffman M, Arnoldi C, Chuang I: The clinical bioinformatics ontology: a curated semantic network utilizing RefSeq information. Pac Symp Biocomput:139–150, 2005Google Scholar
- 9.Noy NF, Fergerson RW, Musen MA: The knowledge model of Protege-2000: Combining interoperability and flexibility. Lect Notes Artif Int 1937:17–32, 2000Google Scholar
- 10.Noy NF, McGuinness DL: Ontology Development 101: A Guide to Creating Your First Ontology. Stanford Knowledge Systems Laboratory Technical Report KSL-01-05 and Stanford Medical Informatics Technical Report SMI-2001-0880, 2001Google Scholar
- 11.Noy NF, Kunnatur S, Klein M, Musen MA: Tracking changes during ontology evolution. Semantic Web—Iswc 2004, Proceedings 3298:259–273, 2004Google Scholar
- 14.Alani H: TGVizTab: An Ontology Visualisation Extension for Protégé. Proc. Proceedings of Workshop on Semantic Authoring, Annotation & Knowledge Markup (SAAKM’02), the 15th European Conference on Artificial Intelligence, (ECAI’02): Lyon CityGoogle Scholar
- 15.Tudorache T, Noy NF: Collaborative Protégé. Proc. Workshop on Social and Collaborative Construction of Structured Knowledge: Banff CityGoogle Scholar