Background

Building a test collection is a major undertaking, therefore test collections are scarce. But the long process of building a test collection gives many insights in the field of information retrieval. This article describes the process from collecting the documents in the underlying corpus, through the creation of search topics, the instructions to the relevance judges including the choice of categories in the assessments of documents for relevance and for intended reader group. The article also presents the structure of the recall bases and the representation of the collection documents in the two indexes, with and without split compounds. To show how a test collection such as MedEval can be used, the article presents a selection of substantial differences between the documents written for professionals and documents written for laypersons, and finally presents experimental runs for the study of retrieval of documents aimed at the two target reader groups.

When the decision was made to build a new test collection, the Department of Swedish at the University of Gothenburg was involved in projects of research in medical language processing. There was also a growing interest of research in information retrieval. As no Swedish medical test collection existed, creating one seemed to be a good investment in knowledge and resources, even though this involved a team of people during many months.

One existing medical test collection, albeit in English, is OHSUMED [1]. It is built of nearly 350,000 references from MEDLINE, and thus the documents contained have medical professionals as intended readers. The OHSUMED documents are assessed on a three graded scale: definitely, possibly and not relevant. OHSUMED contains 106 topics generated by physicians from authentic situations. Each topic consists of information about a specific patient and an information request concerning this patient.

Methods

The collection documents

The MedEval test collection differs from OHSUMED in several ways. It is built on documents from the MedLex medical corpus [2] and contains documents intended both for medical professionals and for laypeople. MedLex consists of scientific articles from medical journals, teaching material, guidelines, patient FAQs, health care information, etc. The set of documents used in MedEval is a snapshot of MedLex in October 2007, approximately 42,200 documents or 15 million tokens (see Table 1). The documents are stored in the trectext format [3].

Table 1 The genres of the documents in the MedEval document collection

Indexes

The MedEval test collection, in its present form, has two indexes. One where the documents are converted to lower case, tokenized and lemmatized, and one where the compounds also are decomposed. In the second index, the compound terms are indexed as a whole together with the compound constituents. For instance: the compound saltkoncentration (English: ‘salt concentration’) is indexed as saltkoncentration, salt, and koncentration. This makes it possible to find matches when a simplex term in a query is used in a document only as a compound constituent, or when a query contains a compound while a document only contains one or both of the corresponding simplex constituent terms. Dealing with compounds and their constituents is important in languages, such as Swedish, where the process of compounding is very productive. As the ratio of compounds in Swedish texts is around 10% (see Table 2) a major part of written information is stored in these compound terms [4].

Table 2 Type and token frequencies of terms

Topics

For the creation of the MedEval information needs, also called topics, two medical students in their fourth year of studies were hired. Their instructions were to create information needs that would be plausible in real medical situations, by doctors or by patients. Guided by explorative searches, the topic creators were asked to adjust the complexity of the topics so that the plausible number of relevant documents for each topic would be not less than five but still not much more than 50. 100 topics were created in the first stage. 62 of these were used in the collection. The process of creating topics was inspired by [5] and is described in more detail in [6].

A topic consists of a title, a description and a narrative. The title is a short phrase summarizing the information need. The description is concise information about the topic, usually in the form of a question or a request. The narrative is a few sentences long and it stipulates what makes a document relevant to the topic. The narrative contains the guidelines for the assessors when judging the relevance of the documents in the next stage. This conforms to the format of the TREC topics [3]. An example of a topic is given in Figure 1 together with an English translation.

Figure 1
figure 1

Sample of information need. An example of an information need, Topic 51, whith ID, title, description, and narrative. The information need is first given in Swedish, as in the collection, thereafter in an English translation.

Selecting documents to assess

An ideal test collection would have a complete set of relevance judgments with every document assessed for relevance to every information need. With a collection of over 42,000 documents and with 62 information needs, as in MedEval, taking an estimated average of 8 minutes to assess each document, working 40 hours a week, it would take four persons over 42 years to finish the assessments.

Instead of assessing all documents for all topics, subsets of documents with a high probability of being relevant to each topic were extracted. These subsets were selected in a series of different runs using basic queries. Since there was limited time and economic resources creating MedEval, the extraction of documents was done on a small scale with only one search engine, namely Indri/Lemur [7].

Four different search methods were used in the extraction, that is, four runs for every information need. For each run, the 100 documents ranked most likely to be relevant were extracted, if in fact so many were retrieved. Two searches were done in each index: with and without decomposed compounds. One search was intended to be broad and one more specific. The number of documents assessed for each information need was between 115 and 358.

For each topic, the result of the extraction was four lists of document IDs. These were merged in one file per topic. The IDs were sorted in alphanumerical order and duplicates were removed. This is important to avoid bias, as the assessors must not know how the documents were ranked in the initial runs or in how many searches each document was retrieved. The documents corresponding to the extracted IDs were printed on paper and fixed in separate bundles for each topic. The papers were printed on only one side to avoid negative bias for short documents ending up on the left page of a spread. The method for selecting documents to assess was based on methods described in [3] and [8].

Relevance judgments

The extracted documents were assessed for relevance according to the corresponding information needs. Four medical students were hired to do the assessments, not the same students as the creators of the MedEval topics. Domain knowledge is essential for understanding the topics and the contents of the documents and also for consistency in judging [9].

It may be expected that the greater the judges’ subject knowledge, the higher will be their agreement on relevance judgments. Subject knowledge seems to be the most important factor affecting the relevance judgment as far as human characteristics are concerned. [9], p. 341

For each of the 62 topics, an assessor read through the documents to be assessed and decided, for each document, the intended group of readers and the degree of relevance to the topic. The documents for each individual need were assessed by one and the same assessor for reasons of consistency. It is not unusual for assessors to disagree on the relevance of a certain document. However, considering two documents, assessors tend to agree which one is the more relevant. As research in information retrieval according to the Cranfield paradigm [10] is based on relative relevance scores, and not absolute relevance scores, this would make the judging sufficiently consistent [11]. This has been concluded in several studies, and already in [9].

It is most significant to note that the relative relevance score of documents in a group [...] may be expected to be remarkably consistent even when judges with differing backgrounds make the relevance judgments. Thus, it may be more profitable to compare the relative position of documents in a set than to compare the relevance ratings assigned to individual documents. [9], p. 341

The findings of Saracevic are supported by later studies conducted by Voorhees [12]. She claims that the important question is not how well assessors agree with one another, but how the results change with these differences in assessement. Her conclusion is that despite differences in assessments between assessors, the evalutation behavior remains the same. Supported by [11], [9], and [12] the creators of the MedEval test collection came to the conclusion that one assessor per topic would be sufficient. More important for obtaining a consistent test collection was to not split any set of documents assessed for a certain topic between different assessors.

The MedEval relevance assessments were made on a four graded scale, 0–3, where 0 is ‘Not at all relevant’ and 3 is ‘Highly relevant’ [13]. This scale is easily turned into a binary scale by stating that the documents with the lower grades are to be considered non-relevant and the ones with higher grades relevant. An impatient user, who is satisfied with one or a few documents, could have only documents with relevance score 3 considered relevant, while a user who is willing to take her time, and who wants as many documents as possible, could let all documents with relevance 1–3 be considered relevant.

The relevance judged by the assessors was topical relevance, how well a document corresponds to a topic. The assessors were instructed not to involve user relevance in this score. Each document was judged on its own merits. The novelty of the contents of a document should not be taken into account.

Target groups

In addition to topical relevance the assessors judged each document for target reader group, that is which group of readers was the intended: Patients, if a document was written for laypersons, or Doctors, if it was written for medical professionals. This assessment was not based on any statistical or formal factors, only on the assessors’ judgments. Some documents were difficult to classify as they were not clearly aimed at a certain group. A number of these documents were labeled with different target groups when assessed for different topics (see Table 2).

For a classification of documents according to intended reader group to be useful, there must be a measureable difference between the document classes. Table 2 shows statistics for different categories of terms in different subsets of the collection. In each set, duplicates were removed in the case that a document had been assessed for more than one topic. The subsets considered are described below. Full form types are the original terms of the documents before lemmatization (with inflections) and lemma types are the same terms after lemmatization (reduced to base form).

Entire collection All documents of the MedEval collection.

Assessed documents All documents that have been assessed for any topic.

Doctors assessed All documents that for at least one topic have been assessed to have target group Doctors.

Patients assessed All documents that for at least one topic have been assessed to have target group Patients.

Common files All documents that for at least one topic have been assessed to have target group Doctors and for another to have target group Patients.

Doctors relevant All documents that for at least one topic have been assessed to have at least relevance grade 1 and to have target group Doctors.

Patients relevant All documents that for at least one topic have been assessed to have at least relevance grade 1 and to have target group Patients.

Before counting frequencies, the files were cleaned from tags, IDs, dates (in the date tag, not in the actual text), web information and punctuation marks. As the tokens were counted after the cleaning of the text, the number of tokens in this table is not consistent with the number of tokens in Table 1.

The number of tokens per document is significantly smaller for the entire collection, than for any subset. This means that there is a large number of short documents that were not retrieved by any query when the documents were extracted. This is not surprising, since short documents contain few terms which can match the queries. The finding that unjudged documents on average tend to be shorter than judged documents, both relevant and non-relevant, is consistent with the results of experiments described in [14]. One reason, according to Karlgren, is that non-retrieved items often contain tables and numerical information. He also concludes that longer documents have a bigger chance of touching relevant subjects, but unfortunately also confusingly similar subjects which are non-relevant.

The documents in the set ‘Patients assessed’ had only 57% the number of tokens per document, compared to the documents in ‘Doctors assessed’. Even though there were over 1,000 more documents in ‘Patients assessed’ than in ‘Doctors assessed’, there were over 50,000 more lemma types in the doctor documents and almost 30,000 more lemma compound types. Type token ratio is a measure of the average times each type, or word form, is used. This measure grows as the size of the set of documents considered grows. This fact makes it even more noteworthy that the type token ratio for the patient documents is significantly higher than for the doctor documents, even though the doctor documents contain more tokens. What this signifies is that there are not as many different types of word forms in the lay texts, but each type is used a larger number of times.

The average word length in ‘Doctors assessed’ was 6.29 compared to 5.73 for ‘Patients assessed’. The ratio of compound tokens was also higher in the doctor documents, 0.128 compared to 0.098.

Additional file 1 illustrates the fact that the doctor documents contain more and longer terms and more compounds than patient documents. The file shows frequencies of all full form types of strings beginning with the random term förmak ‘atrium’ in ‘Patients assessed’ and ‘Doctors assessed’ respectively. The patient documents have 18 full form types beginning with förmak while doctor documents have 75, more than four times as many.

Looking at all instances of strings beginning with förmak in the two sets of documents, for professional and laypeople, there is a significant difference. In the patient documents 66 tokens of 372, or 17.7%, are nouns in the definite form, while the corresponding numbers for the doctor documents is 89 of 932 tokens, or 9.6%. A hypothesis for why this is so, is that medical professionals often discuss matters in a generic point of view, while laypeople discuss specific cases.

Not only Swedish nouns, but also adjectives are inflected for definiteness and number. When comparing the word forms of adjectives in the doctor and patient documents, it is evident that the indefinite non-neuter singular form has relatively higher frequencies in the patient documents. This form would be the one a patient uses when speaking about him or herself or the doctor would use when addressing a patient, but also the form a physician would use when describing ‘a patient’. The definite and the plural adjective inflectional forms are identical to each other, but differ from the indefinite singular forms for most adjectives. This form is used for example when talking about ‘the patient’, ‘the patients’ or ‘patients’. Table 3 shows how the frequencies differ for a few adjectives.

Table 3 Frequencies of adjectives

The conclusion that professionals discuss generic cases while laypeople discuss specific cases is supported by a difference that can be seen in frequency tables of multi-word expressions in doctor and patient documents [6]. High frequencies are found for phrases with meanings such as: in patients with or of patients with. Frequencies are also high for indefinite noun phrases such as: in treatment of or for treatment of The patient documents, on the other hand, contain phrases describing specific patients or specific cases, for example phrases that contain the pronoun you and noun phrases in definite form: when the treatment is completed.

Overall, the documents written for the doctor target group tend to be written in a more disassociated way compared with the patient documents which are more interactive in their approach, addressing the reader directly. While the professional documents tend to discuss research results or cases in general, the lay documents often discuss specific cases. This difference in approach manifests itself, for example in the features described above with the patient documents containing more nouns and adjectives in the definite form, and more pronouns in the first or second person, while doctor documents predominately have nouns and adjectives in the indefinite form, and pronouns in the third person. The professional documents also tend to be written in a more formal way with many multi-word phrases recurring with high frequencies. As there is an apparent difference between the documents written for the professional and layperson target groups, these differences could be used for a precategorization of documents according to genre. Such a categorization could be stored in a separate field in the document representations.

An interesting research question for future projects could be to study the benefit of lemmatizing inflected words, but keeping the inflectional information in tags, or recording the tendency of a text in terms of generic vs specific. This could be a way to keep the higher recall gained by lemmatization, but still use inflectional information for discrimination [6].

User groups

The MedEval test collection allows the user to state user group: None (no specified group), Doctors or Patients. This choice directs the user to one of three scenarios. The None scenario contains the topical relevance grades as made by the assessors. The Doctors scenario contains the same grades with the exception that the grades of the documents marked for Patients target group are downgraded by one. In the same way the Patients scenario has the documents marked for Doctors target group downgraded by one. This means that for a doctor user patient documents by the assessor given relevance 3, are graded with 2, documents given relevance 2 are graded 1 and documents given relevance 1 are graded 0. The same is done in the Patients scenario with the doctor documents. The idea is that a document that is written for a reader from one target group but retrieved for a user from the other group will not be non-relevant, but less useful than a document from the correct target group. Put differently, a document intended for patients would contain information that doctors (hopefully) already know. On the other hand, documents intended for doctors, even though they might be topically relevant for a patient’s need, run a great risk of being written in such a way that a patient will have problems grasping the whole content. This is a way of introducing utility without performing user studies.

Adjusting relevance in the manner described affects the scenario recall bases. Since relevance grades are downgraded for documents of the opposing target group there will be fewer relevant documents in the Doctors and Patients scenarios than in the None scenario. This is demonstrated in Figure 2 where the ideal cumulated gain for the three scenarios of Topics 28, 36, and 92 are shown. The ideal cumulated gain is the maximum score of retrieved information possible at each position in a ranked list of documents [15]. The score for each position is the sum of all relevance scores so far in the ranked list.

Figure 2
figure 2

Recall bases for three topics and three scenarios. The recall bases of Topics 28, 36, and 92 represented in ideal cumulated gain for the three scenarios: None, Doctors and Patients. For Topic 28 most of the highly relevant and fairly relevant documents were assessed to have target group Doctors. Topic 36 had the relevant documents spread fairly evenly between the Doctors and Patients target groups. Topic 92 showed no documents of any relevance grade for documents marked for target group Doctors. Thus the None and the Patients ideal gain vector coincide fully, while the cumulated gain for the Doctors scenario is very low originating from the downgraded patient documents.

The three topics of Figure 2 show different characteristics with reference to the number of relevant doctor and patient documents. Topic 36 has fairly similar cumulated gain curves for the Doctors and Patients scenarios. Topic 28 has a majority of doctor documents, while Topic 92 has no documents of any relevance grade for documents marked for target group Doctors. Thus the None and the Patients ideal gain vector coincide fully, while the cumulated gain for the Doctors scenario is very low, originating from downgraded patient documents.

Results

To demonstrate the effectiveness of search terms from the different styles of language of the two target groups, a number of synonym pairs were used as search keys for corresponding topics. Each synonym pair consisted of one neoclassical term, belonging to the expert register, and one lay term. The terms of each pair were run separately as single search key queries, and also combined in one query. All queries, three for each topic, were run in the doctors scenario and in the patients scenario. Note that for each query the resulting ranked list of documents is the same for both scenarios. It is the recall bases, and thus the relevance grades of the retrieved documents, that differ.

As MedEval, to the authors’ knowledge, is the first medical test collection with user groups, there are no earlier equivalent tests. However, [16] address the fact that medical experts and non-experts express themselves in different ways, and that this affects search results. The authors are motivated by the empowerment of laypersons and discuss how to exchange information across user groups. The goal is that a search using non-expert terms should retrieve all types of documents written on the topic. They see the problem as a question of automatic alignment between specialized terminology and general terminology and enrich the information retrieval system with a set of links between corresponding concepts in lay and professional language.

The contrast between Swedish professional medical language and Swedish lay language is addressed in [17]. The authors have selected documents concerning cardiovascular disorders from the MedLex corpus [2]. Their findings may be used as a basis for future studies on how to differ searches with the purpose of retrieving documents for the different user groups. The findings have inspired the choice of entries in Table 2 showing differences between the sets of doctor and of patient documents.

Measures of effectiveness

The effectiveness of the queries described above was measured in recall after 10, 20, and 100 retrieved documents. This represents the impatient, the slightly less impatient, and the patient user. The effectiveness was also measured in normalized discounted cumulated gain, nDCG [15]. The nDCG is based on the cumulated gain described earlier, but uses a discounting factor which reduces the amount of the relevance score added for each document in the ranked list. The relevance score is discounted by a logarithmic function of the position number. The assumption is that the later in the list a document is found, the less it is worth to the user. The normalization infers that the discounted cumulated gain is compared to the ideal discounted cumulated gain in each position. Thus the nDCG value summarizes the effectiveness in all positions earlier in the ranked list, and compares this summarized effectiveness to the maximum value possible in each position. As the nDCG value is relative to the maximum value possible, it varies between 0 and 1 and gives no bias to topics with small or large recall bases.

Even though recall and nDCG both measure effectiveness, there is not an absolute correlation between them. Recall is calculated on a binary scale. In this case documents with relevance score 1 are considered non-relevant. The nDCG, on the other hand is calculated on a four-graded scale, 0-3, and all scores from 1 to 3 are included in the measure. This entails that the nDCG value can seem high compared to the recall value if the ranked list includes documents with relevance score 1. On the other hand the recall value can seem high compared to the nDCG value if there are relevant documents late in the ranked list.

The runs

Two of the more striking results are the runs for Topics 51 and 66, shown in Tables 4 and 5. For both of these topics the lay terms have very low effectiveness in the doctors scenario. There is no gain in using the lay term, neither as a single search key nor in combination with the neoclassical term. In the patients scenario there is less difference between the expert and the lay terms, used as single search key queries. The best result is achieved by using both terms in combination.

Table 4 Runs for Topic 51
Table 5 Runs for Topic 66

Topic 63, in Table 6, shows low recall for the lay term in the doctors scenario as well as low recall and nDCG for the expert term in the patients scenario. In the doctors scenario the recall does not improve by adding the lay term to the expert term query. However the nDCG value improves. This means that relevant documents now appear earlier in the list. In fact the nDCG value for the lay term is surprisingly high. A closer look at the ranked list shows not less than six documents with relevance score 1 among the first ten, giving a high nDCG value. For this topic there is no gain in the patient scenario in combining the two terms, most likely because the neoclassical term has very low effectiveness.

Table 6 Runs for Topic 63

For Topic 48, in Table 7, we again see low results for the lay term in the doctors scenario, while there is less difference between the terms in the patient scenario. In both scenarios we see a significant improvement when the two terms are used in combination.

Table 7 Runs for Topic 48

For Topic 7, Table 8, the neoclassical term gives best results in both the doctors and the patient scenarios, while for Topic 83, Table 9, the lay term gives best results in both cases. The patient scenario for Topic 7 does not show any gain in combining the search keys, the best result is still using the expert term. Topic 83, which has an effective lay term, even in the doctors scenario, here shows improved effectiveness when combining the terms.

Table 8 Runs for Topic 7
Table 9 Runs for Topic 83

Topic 68, in Table 10, has reasonable results for all single search key queries in both scenarios. The best results are for the expert term in the doctors scenario and for the lay term in the patients scenario. In both scenarios the recall is improved when the terms are combined, but the nDCG value is not.

Table 10 Runs for Topic 68

In most cases, and not surprising, the expert terms are most effective in the doctors scenario and the lay terms in the patient scenario, but there are both expert and lay terms that achieve best results in both scenarios. The expert terms tend to give better results in the patient scenario than the lay terms in the doctors scenario. However more extensive studies, including comparisons of search results with relative frequencies of lay and expert terms, are needed before definite conclusions can be drawn.

Conclusions

This article describes the process of building a test collection for information retrieval purposes. The process includes the collection of a corpus, creation of search topics, decisions about relevance assessments, such as selecting documents to assess and deciding on the assessment categories for the judges. Further the process includes how to represent the recall bases and how to represent the documents in the collection indexes.

The article goes on to show a number of aspects of medical information retrieval which can be studied utilizing the MedEval test collection. The main novelty of the collection is the marking of document target groups, Doctors and Patients, together with the possibility to choose user group. This opens for new areas of research in Swedish information retrieval such as how one can retrieve documents suited for different groups of users. As was shown in the example runs, search keys from different registers behave differently in the doctors and in the patients scenario.

A number of differences between the documents written for experts and for non-experts are presented along with the suggestion that these differences may be utilized in future studies of document retrieval for the different user groups.

Not least important is that MedEval is a Swedish domain specific test collection. A test collection in a language other than English allows a new range of research possibilities studying the impact of natural language processing in information retrieval.

The Department of Swedish at the University of Gothenburg is in the process of making the MedEval test collection available to academic researchers.