INTRODUCTION

Typical DAM systems contain primarily visually rich files such as images, logos and line art, designed documents like QuarkXPress, Adobe InDesign and Illustrator files, audio, video, animation and more.1 These images may be used in a variety of business content including advertising, marketing materials, multimedia press kits, sales kits, training materials, corporate presentations, published content such as books, magazines and other business documents that are rich in image content.

When searching for images, there are no textual clues for retrieval. The quality of retrieval depends strictly on the quality of the metadata applied, especially the keywords that describe the images. The keywords provide the “ofness” and “aboutness” of the images. Inconsistent keywording, and limited understanding of the vocabulary typically used for image searching, results in a repository of images that cannot be retrieved and therefore are lost in the DAM system.

Depending on social tagging2 where designated user/indexers randomly assign any words they feel are appropriate (see Flickr3 or del.icio.us4) leads to imprecise and ambiguous descriptions of images.5 For example, there is no immediate way of telling whether a photo tagged with “apple” shows a fruit or a computer. Plus, a search for “apple” will miss relevant images tagged as “GrannySmith.”6 Plural and singular forms, conjugated words and compound words may be used, as well as specialized tags and “nonsense” tags designed as unique markers that are shared between a group of friends or co-workers. The result is an uncontrolled and chaotic set of tagging terms that do not support searching as effectively as more controlled vocabularies do.7

This paper will focus on developing a controlled vocabulary in the form of a thesaurus, specifically for use in applying keywords to images. The principles here can be applied to textual documents and also to other media contained in DAM systems. The intention is to acquaint readers who are not experienced in the disciplines of library science and taxonomy development with basic concepts, terminology and the applications of thesaurus construction.

TAXONOMY OR THESAURUS?

The terms taxonomy and thesaurus tend to be used inconsistently, even by professionals in the library science, taxonomy, and ontology fields. After all, there is an inherent difficulty in defining a list of words with words. Since an industry standard is a good place to start, we will first look at the following definitions from the “Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies,” an American National Standard developed by the National Information Standards Organization (ANSI/NISO).8

Taxonomy: A collection of controlled vocabulary terms organized into a hierarchical structure. Each term in taxonomy is in one or more parent/child (broader/narrower) relationships to other terms in the taxonomy.”9

Thesaurus: A controlled vocabulary arranged in a known order and structured so that the various relationships among terms are displayed clearly and identified by standardized relationship indicators….”10

The classic use of the term taxonomy is seen in classification of biological organisms as seen in Figure 1.

Figure 1
figure 1

The classic use of the term “Taxonomy” as applied to biological organisms, with “Kingdom” being the broadest or least specific term and “Species” being the narrowest or most specific term.12

A thesaurus is a type of taxonomy focusing specifically on the relationships between the terms. It provides a standardized terminology or controlled vocabulary for a particular area of knowledge. The hierarchical arrangement or parent/child (broad/narrow) relationship between the terms provide a way to group or categorize information into logical subjects or topics.

According to Merriam Webster dictionary, the definition of thesaurus is:

“…a book of words or of information about a particular field or set of concepts; especially: a book of words and their synonyms b: a list of subject headings or descriptors usually with a cross-reference system for use in the organization of a collection of documents for reference and retrieval.11

The above definition refers to “a book of words and their synonyms” as seen in association with Roget's Thesaurus. This reference work is used to find synonyms, providing cross-references to terms and spelling out relationships between them. A thesaurus applied to an electronic information retrieval system basically does the same.

So, to summarize, a thesaurus:

  • Provides a controlled vocabulary for the description and retrieval of information. (A controlled vocabulary limits the use of words to an agreed-upon unambiguous set of terms.13)

  • Is in a hierarchical arrangement with broad to narrow relationships between terms.

  • Defines other relationships between terms including synonyms (equivalence) and related terms (associative).

  • Encompasses the vocabulary used for a particular field.

  • Clarifies word meanings in the case of homonyms or ambiguous terms.

  • In a DAM system, can be used to describe assets and to aid in their retrieval.

illustration

figure b

ANATOMY OF A THESAURUS

Every field of knowledge has its vocabulary (thus, thesauri) and thesaurus construction and maintenance is no exception. Here is a brief introduction to some of the terminology used to express relationships between the terms in a thesaurus. Terms in a thesaurus are referred to as:

  • Preferred Term (PT or USED FOR, indicated as UF) — The term that has been selected to be included in the controlled vocabulary. UF (USED FOR) is seen next to the Non-Preferred Terms (see below) for this word.

  • Example:

  •   Jewelry (Preferred Term)

  • UF Jewels (Non-Preferred Term)

This means: Use the term “Jewelry” to find “jewels.” In other words, if “jewels” is entered into a search, assets with the keyword “jewelry” will appear in the search results.

  • Non-Preferred Term (NP) — A term that is equivalent to the Preferred Term. This term would have “USE” next to it as a cross-reference to the Preferred Term. Non-Preferred terms are often referred to as synonyms.

  • Example:

  •   Jugs

  • USE Pitchers

This means: Use the term “Pitchers” to find jugs OR If “jugs” is entered into a search, assets with the keyword “pitchers” will appear in the search results.

  • Broad Term (BT) — A term that is more general. For example, animal:mammal (animal is the broad term)

  • Narrow Term (NT) — A term that is more specific. For example, animal:mammal (mammal is the narrow term)

  • Related Term (RT) — A term that has an associative relationship to a word. It points users to information they might be interested in if they are interested in a particular term.

  • Example:

  •   Food

  • RT Nutrition

  •   Cooking

  •   Eating & Drinking

This means: If you’re interested in food, you might also be interested in nutrition, cooking or eating & drinking.

A picture is worth a thousand words and in the case of thesauri this is particularly true. See the sample from the “Library of Congress Thesaurus for Graphical Materials”15 for the term “Storms” (Figure 2).

Figure 2
figure 2

Library of Congress Thesaurus for Graphical Materials entry for “Storms”

In the example in Figure 2, the term “Storms” is the preferred term (PT). The terms “Natural Disasters” and “Thunderstorms” are Non-Preferred Terms (NP, see Used For next to these terms), “Weather” is the Broader Term (BT), “Blizzards,”“Cyclones,”“Dust Storms,” etc are Narrower Terms (NT) and “Disasters,”“Floods,”“Hail,” etc are Related Terms (RT). See Figure 3.

Figure 3
figure 3

Broad to narrow terms in the example of the entry for “Storms” in the Library of Congress Thesaurus for Graphical Materials

WHY DO WE NEED A THESAURUS?

One of the main problems to be solved in DAM projects is when insufficient, inaccurate or incomprehensible information is held about the digitized materials. Simply digitizing video and audio does not turn them into assets: the value arises from their use and the relevance of the associated data.16 The content of images cannot be described sufficiently using metadata elements typically applied to textual materials like title, table of contents, index, abstract and of course, the full text. This is not the case for images. Using captions is helpful and is recommended as a searchable field in an image database, but the what? who? when? where? and why? of a digital image are found in the keywords used to describe it.

Let us look at an example of a photograph and describe it using keywords (Figure 4). For the photo below, ask yourself “What is this a picture of?” (Ofness) and “What is this picture about?” (Aboutness).17 What ideas, concepts and terms would a user have in his/her mind as he/she was searching for this type of photo? What are the possible uses for this photo within your organization?

Figure 4
figure 4

Example photograph

Here are some keywords that might be applied to the photo shown in Figure 4 :

Ofness:

Girl

Two girls

Kindergarten

Seven–eight years old, seven years old, eight years old

Aboutness:

Running

Outdoors, outside

Playing, playful

Fun

Happy, happiness, joy, joyful, enjoying, enjoy

Laughing, laughter

Friendship, friends

Summer

You may have thought of other terms that describe this photograph, but for the purpose of this exercise we will use the keywords listed here. When doing this exercise with a live group, we have found that people use different words to express similar ideas, concepts and even things. Therefore, ambiguity is inevitable. This ambiguity makes a controlled vocabulary in the form of a thesaurus essential to any image-retrieval system. For example, some users may search using the keyword “outdoors” while others will use “outside.” Some may use “friend” or “friends” while others use “friendship.” If all of these words are not somehow incorporated into the metadata for this image, many users will not find it.

Let us create a mini-thesaurus using the keywords we just applied to this photograph. (To simplify, we are using “synonym” instead of “non-preferred term” and not all possible synonyms and related terms are included):

This mini-thesaurus is hierarchical, is a controlled vocabulary, provides a standardized terminology that can be used to retrieve assets, clarifies meanings where there is a possibility for ambiguity, and includes variations of words and keyword phrases.

illustration

figure a

APPLYING THIS TO A DAM SYSTEM

In order to build a thesaurus that can be effectively used within a DAM system, the DAM software needs to include thesaurus maintenance functionality. There are a few systems on the market with this functionality such as Artesia TEAMS and Quark DMS. It is highly recommended that this be considered before purchasing a DAM system, especially if most of the content is visually rich.

In an organization where users need immediate access to assets in a DAM system, a standard and controlled vocabulary approach will lead to more accurate image retrieval. The implementation of a solid keyword vocabulary in combination with an indexing and keywording policy will greatly enhance retrieval of images in a DAM system.

IMPLICATIONS FOR INDEXING AND RETRIEVAL

What is the advantage of having a hierarchical arrangement to terms (broad to narrow relationships)? For one, it makes indexing easier by having all the words in a particular category grouped together. For example, all emotions (happiness, sadness, fear, embarrassment, etc) that might apply to images may be narrow terms with the broad term being “emotions.” The indexer can browse all the “emotions” in the thesaurus while applying keywords to the image, instead of hunting around for appropriate terms in an alphabetical or another arrangement.

These keyword suggestions are invaluable to indexers, especially the top or the broadest term. For example, the following broad terms or groupings of words would help an indexer describe an image of a person or people: action, age, emotion, number of people, concept, perspective (ie close-up), gender, ethnicity and emotion. A thesaurus enables groupings of this type and makes the assignment of keywords less random. A good indexer is always looking for suggestions to help in the description of an image. The different levels or groupings within a thesaurus can assist in finding terms that might have otherwise been forgotten.

As we have already seen, there are many advantages to using a thesaurus on the search and retrieval side:

  1. 1

    It enhances the accuracy of search results by providing a controlled vocabulary that accounts for variations in terms, including synonyms, plurals and possibly even misspellings.

  2. 2

    Related terms can be used to suggest other images that may also be of interest (eg, a link to “More Like This” or “Similar Images”).

  3. 3

    If visible to users and searchable, users can browse keywords and brainstorm for search terms they might want to use.

  4. 4

    Assets assigned with more specific keywords (narrow terms) can be retrieved when searching for a general keyword (broad term). Image searches should have the flexibility to find both the general and the specific. For example, you should be able to find all the birds in the database (general search) and also an image of an eagle (specific search). If possible, search engines or queries should be configured to search narrow terms associated with the keyword entered by the user. By doing this, a user can enter the keyword “bird” and retrieve images of all of the birds in the DAM system. See Figure 5.

    Figure 5
    figure 5

    Thesaurus-enabled search: broad to narrow term relationships and Open Vocabulary keyword search: no term relationships

The alternative to using a thesaurus for keywords is open vocabulary keywording where there is no relationship between the words. As seen in Figure 5, using an open vocabulary for keywording, indexers must remember to include not only the specific terms in the keyword field (like “eagle” or “sparrow”) but also the more general —“bird.” Otherwise a search for “bird” will yield no results, despite the fact that there are images of birds in the DAM system. Open vocabulary keywording also requires the manual entry of all variants, plurals and synonyms. This scenario is not practical for the efficient and standardized application of keywords, the most important piece of metadata for finding images.

THE NEED FOR STANDARDS AND GUIDELINES

Most enterprises employing a DAM system will have many people of varying levels of skill and aptitude applying keywords to and indexing visual materials. Therefore, it is important to make the process as simple and streamlined as possible. Applying keywords to images is more art than science and is very subjective. What is seen and described is often in the eye of the beholder. Therefore, it is important to have indexing guidelines in place, especially for applying keywords. The Picture Agency Council of America (PACA), New Technology Committee published “PACA Keywording Guidelines” in November of 1996.18 This document is helpful to use as a guide. Also, Getty Images and Corbis publish “Keyword Guides” to assist users in searching their collections. These guides offer insight into the keywording guidelines for these leading stock photo companies. Another source is the Library of Congress documentation from the “Thesaurus for Graphic Materials I: Subject Terms (TGM I).” There is a section called “Indexing Images: Some Principles” available at http://loc.gov/rr/print/tgm1.iihtml.

As mentioned above, the industry standard for construction and maintenance of thesauri is: “Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies”19 developed by the National Information Standards. A free download is available at http://www.niso.org/standards/resources/Z39-19-2005.pdf. This is an excellent source to consult for thesaurus construction, maintenance and use. Although it is mostly intended for the library and academic community, it can serve as a good starting point for development of a thesaurus standard for your organization. The abstract to the standard offers a summary of its content:

“The Standard presents guidelines and conventions for the contents, display, construction, testing, maintenance, and management of monolingual controlled vocabularies. The Standard focuses on controlled vocabularies that are used for the representation of content objects in knowledge organization systems including lists, synonym rings, taxonomies, and thesauri….”20

REAL-WORLD CONSIDERATIONS: KNOW YOUR USERS

A thesaurus of image keywords can be fairly simple with a hierarchical structure that only reaches from two (eg — dog: beagle) to four levels (eg — animal: mammal: dog: beagle). For most corporations and organizations, the simpler the structure, the better. Consider the searching audience. Who are your users? What are their searching needs? For example, it may not be necessary to include the level of “mammal” in the example above if your images are primarily used for popular media and advertising. If the DAM system serves a magazine that specializes in animal behavior, the word “mammal,” however, may be included to add flexibility to the search.

It is important to develop standards and policies as stated above, but above all, these measures need to be appropriate for your organization and the people in it. The vocabulary should consist of terms that are used by the searching population and interfacing with the DAM system. In a non-technically oriented corporate setting, an academic approach to a controlled vocabulary will detract from its usefulness and people are not likely to use it. For example, a “glass” (for drinking) is not a “drinking vessel.” Although in other settings this may be an appropriate term for the object, users would not search using this term, so it should not be part of the vocabulary.

A standard for a corporation involved in a technical area, an academic institution, museum, archive or historical collection needs to adhere to terminology from the specific area of knowledge. As mentioned earlier, a sampling of thesauri or controlled vocabularies from various universes of knowledge, organized by discipline and topic, is available at Taxonomy Warehouse http://www.taxonomywarehouse.com.

If your DAM serves a popular media organization such as a newspaper or magazine, take a good look at popular stock photo websites like Corbis (http://www.corbis.com) and Getty Images (http://www.gettyimages.com). Stock photos are those typically used for editorial and advertising purposes. You can learn a lot about the vocabulary used by photo researchers and the typical image user by viewing the keywords applied to images on these sites. Also, be sure to talk to image researchers in your organization. They offer the key to understanding how people search for images and the type of terms used when searching for this type of asset.

ARE WE THERE YET?

Work on a thesaurus is never complete. There will be a need for new terms and word relationships on an ongoing basis. As more topics and media are introduced, the vocabulary will grow and new term relationships introduced. It is important to have a skilled person on board to manage the thesaurus and its growth. Establishing a standard both for thesaurus development and for entry of keywords is highly recommended.

BUILD IT YOURSELF OR HIRE A CONSULTANT?

There may be (and probably is) at least one list of keywords within your organization that is currently being used to index images in either a DAM or other database system. One option is to hire a freelance taxonomist/thesaurus developer to organize it all into a coherent controlled vocabulary with relationships between the words and inclusive synonym and word variants. This person would consult with staff regarding their search strategies, retrieval needs and the vocabulary they use to find images. If your company has a library, you might be able to enlist a librarian on staff to assist with this task or help you find a consultant.

If there is no keyword list in place, you might consider purchasing a thesaurus or downloading a free vocabulary. These “pre-packaged” controlled vocabularies may not meet your every need, but will provide an excellent framework around which an appropriate list can be built.

If your DAM system does not have thesaurus management functionality, there is thesaurus management software is available for this purpose. Some of these products can be integrated with a DAM system:

There are a few keywording software products on the market that provide a vocabulary and metadata structure appropriate for stock photo collections:

SUMMARY AND CONCLUSION

Construction and maintenance of a keyword thesaurus is highly recommended for effective retrieval of images in a DAM system. The hierarchical arrangement of terms and the relationships between terms provides a controlled vocabulary that can be used to index and retrieve images. The use of a thesaurus eliminates ambiguity, clarifies word meanings and can serve to offer suggestions for terms that can be used to index images. Using a controlled vocabulary in the form of a hierarchical thesaurus has strong implications for search and retrieval of images. It improves the quality of results by enabling searches for narrow terms when a broad term is entered and accounting for variations in search terms.

The use of standards and a keywording policy reduces the subjectivity involved in indexing images. The NISO standard (see references) is a good place to start for developing a standard for your organization.

The vocabulary should meet the needs of the organization it serves. Photo researchers and stock photo sites are important sources of information about the vocabulary used for both indexing and retrieval of images.

Try to acquire a DAM system with a thesaurus construction and maintenance module. If you already have a system in place that does not have this feature, look into thesaurus maintenance software. Some of these packages can be integrated into a DAM system. Hire a trained professional to assist in organizing an already existing keyword list, or start from scratch by acquiring a thesaurus from a free or paid source. Thesaurus maintenance is an ongoing process, but the investment of time and money will enable an organization to effectively leverage visual assets in a DAM system.

ADDITIONAL RESOURCES