In the next three subsections, we address the three research questions from Sect. 1 with a series of experiments:
-
1.
What is the influence of the collection size? (Sect. 5.1)
-
2.
What is the influence of the background collection? (Sect. 5.2)
-
3.
What is the influence of multi-word phrases? (Sect. 5.3)
In each subsection, we address two of the four evaluation collections. Table 4 shows an overview. Each subsection is concluded with a discussion of the experimental results in the light of the hypotheses in Sect. 3.5.
Table 4 Overview of experiments per research question
What is the influence of the collection size?
Table 5 shows the sizes of the four document collections. It shows that the Author Profiling and QUINN collections are large, and that the other two are relatively small in terms of number of words. QUINN has a large number of documents but since we only have access to the abstracts of news articles, the document length is small (63 words on average). In Personalized Query Suggestion, the number of documents is reasonable, but the documents are also relatively short, since they consist of metadata or the first 200 words of a pdf. The collections in Medical Query Expansion are the smallest, with only 1 document of 609 words on average per topic.
Table 5 Sizes of the four document collections
We address two collections in this section: the Author Profiling collections, where we evaluate term scoring for increasing word counts, and discharge summaries for Medical Query Expansion, where we investigate how different methods perform on collections with a small number of words.
The influence of collection size on the effectiveness of term scoring
We investigate the effect of the collection size by manipulating the Author Profiling collections as follows: we split all documents from the collection in paragraphs, randomize the order of the paragraphs, and then create subcorpora with increasingly more paragraphs from the collection, up to {100, 500, 1000, 5000, 10,000, 20,000, 30,000, 40,000, 50,000} words. We then evaluate term extraction for each subcorpus. The reason that we increase the size of the corpus by paragraph and not by document, is that documents are relatively long and covering one topic each, as a result of which the presence or absence of a complete document will strongly influence the presence or absence of topics in the list of extracted terms, especially in the smaller collections. The randomized sampling of paragraphs ensures a smoother curve. Because of the randomization component, we run each experiment five times and report averages over these five runs.
We evaluate all five term scoring functions for the increasing collection size.Footnote 11 For PLM, we set \(\lambda =0.1\), which was suggested as optimal in the original paper (Hiemstra et al. 2004). PLM, FP and KLIP (KLI) require a background collection. We used a corpus of generic English for this, the Corpus of Contemporary American English (COCA) (Davies 2009), which contains 450 Million words. The owners of this corpus provide a word frequency list and n-gram frequency lists that are free to download.Footnote 12
Figure 1 shows mean average precision scores over the users in the Author Profiling data for increasing collection sizes. For CB, we evaluated both \(|G|=10\) and \(|G|=100\) for the reference set of top-frequent terms G and they give almost the same results. Apparently, the distribution of co-occurrence frequencies does not change much when we use a larger reference set of top-frequent terms in the collection. Therefore, we only show the results for \(|G|=10\) here. Of the informativeness methods, PLM, KLI and FP give better results than CB. The results also show that KLI and FP reach their maximum effectiveness at a collection size of 20,000 words, and do not improve anymore with increasing collection sizes. PLM and CB reach their maximum earlier: PLM does not improve after 10,000 words and CB’s effectiveness improves only slightly after 1000 words, but not anymore after 5000 words. This is not surprising giving the original purpose of the methods: PLM and CB were designed for term extraction from a single document.
The phraseness methods behave interestingly. We see that both KLP and C-Value perform better than any of the informativeness methods for collections larger than 20,000 words. There are two reasons for that: First, multi-word terms are important for the scientific domain and judged as better terms by human assessors and second, multi-word terms are less sparse in larger collections.
The graph also shows that KLP performs better than C-Value. This is an interesting finding because the two methods use different criteria for selecting terms: both favor longer terms over shorter terms, but in C-Value, the score for a term is discounted if the term is nested in frequent longer terms; in KLP, the frequency of the term as a whole is compared to the frequencies of the unigrams that it contains. Thus, KLP prefers frequent multi-word terms consisting of lower-frequent unigrams, while C-Value prefers terms that are not nested in longer terms. Table 6 shows example output for KLP and C-Value to illustrate this difference. For completeness, the example output for the informativeness methods is also added to the table.
Table 6 Example output of each of the term scoring methods for one of the Author Profiling collections: the top-10 terms of the expert profile generated from the collection of scientific articles authored by one person, who has obtained a PhD in Information Retrieval
The lists for KLIP and C-value are similar, showing largely the same terms, although their ranks are different. Terms that are selected by KLP and not by C-Value are ‘new york’ and ‘entity ranking topics’. Terms that are selected by C-Value and not by KLP are ‘category information’ and ‘target categories’. ‘new york’ is probably the most clear example of the difference between the methods: in this corpus, the term ‘new york’ is almost as frequent as the unigram ‘york’. In other words, ‘york’ almost only occurs together with ‘new’, which makes ‘new york’ a very tight n-gram, and therefore a strong phrase for the KLP criterion. For C-Value however, the phrase is not very strong because it is nested in a number of frequent longer phrases such as ‘new york university’ and ‘new york ny’.
Comparing methods for small data collections
As shown in Table 5, the Medical Query Expansion data collection is small (1 document of 609 words on average per topic). Therefore, we use this collection to evaluate the performance of the term scoring methods for small data collections. Medical Query Expansion is a use case with an extrinsic evaluation measure: nDCG for the set of retrieved documents (see Sect. 4.4). In order to evaluate the term scoring methods, we extract terms from the discharge summary belonging to the topic and add an increasing number of top-ranked terms (0,2,5,10,20) to the query. Table 7 shows an example query with expansion terms.
Table 7 Example query from the CLEF eHealth data for the Medical Query Expansion collection with the top-5 terms extracted from the discharge summary using five different term scoring methods
We experiment on the training set provided by CLEF (5 topics) with the following settings for query expansion:
-
(a)
the length of the original query: using only the words from the title of the topic or words from the title and the description of the topic;
-
(b)
the operator for multi-word terms: #1, #2 or #uw10;Footnote 13
-
(c)
the weights for the expansion terms: uniform (each term gets as weight 1 / T, where T is the number of expansion terms) or the term score that each term received from the term scoring algorithm.
For PLM, we optimize the parameter \(\lambda\) on the training set, investigating values ranging from 0.0001 to 1.0, of which 0.01 turned out to be optimal. For KLIP, we set \(\gamma =0.5\). We found that title-only gave better results than title+description; that the operator #2 was slightly better than the other two, and that term scores as weights were a bit better than uniform weights. Below, we show the results obtained on both the training set (5 topics) and the test set (50 topics) for these settings. The bottom row of Table 7 shows an example of an expanded Indri query.
The results are in Fig. 2. Surprisingly, we seem to obtain positive results on the training set that are not replicated on the larger test set. The mean nDCG for the test queries without expansion terms is very close to the mean nDCG for the train queries, but adding terms from the discharge summary does not give the seemingly positive effect that it has on the training set. Since the training set is small (only 5 topics), we suspect that the different behaviors between train and test set are due to individual differences between topics. The graphs in Fig. 2 represent averages over all topics; the standard deviations are relatively large: between 0.20 and 0.23 for each point in the graphs. There are topics for which the expanded terms have a positive effect, and there are topics for which they have a negative effect, and there are topics for which they have no effect. A closer look at the top-10 extracted terms for each of the termscoring functions shows that the 20 most occurring terms are the following:
mg
|
tablet
|
right
|
blood pressure
|
sig one
|
one
|
mg tablet sig
|
admission date
|
mg po
|
sex
|
tablets
|
tablet sig
|
patient
|
sig
|
po
|
day
|
mg tablet
|
discharge
|
one tablet
|
tablet sig one
|
These are all generic terms in the medical domain. If we look at the frequencies for the top-term ‘mg’, we see that it occurs dozens (\(>30\)) of times in each of the discharge summaries in our set, and although it is also frequent in the background collection of discharge summaries (1,266 occurrences on a total term count of 194,406), its high frequency in the foreground collection still make it a good term according to the term scoring functions, which all have term frequency as their most important component. More specific terms, such as medicine names (e.g. glipizide, risperidone) occur lower in the term lists; their absolute frequencies are much lower: below 5. It seems that all methods are hampered by the small collection size (609 words on average per discharge summary), combined with the semi-structured nature of the texts in which there is a lot of repetition of technical phrases such as ‘mg po’ and ‘sig one’.
Discussion: What is the influence of the collection size?
In Sect. 5.1.1 we studied the effect of collection size for a use case with a human-defined ground truth: Author Profiling. We found that larger collections lead to better terms. PLM gives the best results for collections smaller than 5,000 words, while both KLP and C-Value perform better than any of the informativeness methods for collections larger than 20,000 words. KLI and FP reach their maximum effectiveness at a collection size of 20,000 words; PLM at 10,000 words and CB at 5,000 words. The poorest performing method is CB. This is the only informativeness method that does not exploits a background collection for calculating the informativeness of terms, but instead uses the set of frequent terms in the foreground collection as a proxy for a background collection. A method that does not require a background collection could be appealing, because it eliminates the choice for a background collection, but apparently, the set of frequent terms from the foreground itself is a weak background model. This confirms our hypothesis:
Hypothesis: We expect that larger collections will lead to better terms for all methods, because the term frequency criterion is harmed by sparseness. In addition, we expect that PLM is best suited for small collections, because the background collection is used for smoothing the (sparse) probabilities for the foreground collection. Although CB was designed for term extraction from small collections without any background corpus, we do expect it to suffer from sparseness, because the co-occurrence frequencies will be low for small collections. We expect KLIP and C-Value to be best suited for larger collections because of the sparseness of multi-word terms. The same holds for FP, which is similar to KLIP, and was developed for corpus profiling.
In Sect. 5.1.2, we found that all methods are hindered by small collection sizes (a few hundred words): the absolute frequencies of specific terms are low and 1 or 2 additional occurrences of a term makes a large relative difference.
In order to provide more insight in the effect of corpus size on term extraction performance, we investigated the type-token ratios for the author profiling and the Medical Query Expansion collections. Type-token ratio (TTR) is a measure of lexical variety: it gives the ratio between the number of unique words (types) and the total number of words (tokens) in a corpus. It has been reported before that TTR is related to corpus size: the larger the corpus, the lower the TTR (?). A high type-token ratio indicates that many terms only occur once, as a result of which the frequency criterion bears little relevance. Since the frequency criterion is central to all term scoring methods, we would expect the methods to perform poorly on collections with a high TTR. Figure 3 shows the TTR als function of the corpus size, for both collections. The TTR graphs confirm the relation between TTR and corpus size. It shows that the Medical Query Expansion collections have a high type-token ratio, 0.59 on average, with an average corpus size of 609. The TTR for the author profiling collections at this corpus size is similar: the gray line is very close to the black dots. In Fig. 1, we see that for this corpus size, all term scoring methods perform poorly relative to their performance with the maximum corpus size: between 0.05 and 0.20, while they reach between 0.18 and 0.53 at their maximum.
This analysis confirms our finding that the term scoring methods all perform poorly on small corpus sizes. We speculate that this is caused by the prominence of the frequency criterion in all methods: For small collections term frequency is a weak variable: most terms occur only once or a few times.
What is the influence of the background collection?
The choice of the background collection depends on the language and domain of the foreground collection, and on the purpose of the term extraction. In this section, we evaluate the effect of the background corpus in three informativeness methods (PLM, KLIP (KLI) and FP), for two collections: Personalized Query Suggestion, where we compare a generic and a domain-specific background corpus, and QUINN, where we compare the use of an external background corpus (a Dutch news corpus) and the use of a topic-specific collection: an older subcollection of documents for the same query.
Comparing methods with different background corpora in the personalized query suggestion collection
We first investigate the effect of the parameter \(\lambda\) in the PLM method. \(\lambda\) defines the weight of the background collection in smoothing the term probabilities for the foreground collection. We extract terms from the subcollection of relevant documents using PLM, with two different background collections: the iSearch collection (which would be the ‘natural’, domain-specific background corpus for this collection) and COCA (which is an external corpus, with general language).
We use the topics 001–031 from the iSearch data to optimize the parameter \(\lambda\) and we investigate values of \(\lambda\) ranging from 0.0001 to 1.0. The results are in Fig. 4. Note that \(\lambda =1.0\) is the setting in which the background corpus frequencies are not used at all and the algorithm does not change the initial values of P(t|D). The plot shows that (a) Mean Average Precision is low for this collection. This is because the ground truth is very strictly defined; we did not collect relevance assessments for all returned terms; (b) iSearch as background corpus seems to give better results than COCA, but this difference is not significant (for the \(\lambda\)-value with the largest difference, \(\lambda =0.01\), a paired t test on the AP-scores for individual topics gives \(p=0.263\) for the difference between COCA and iSearch); (c) the effect of \(\lambda\) is almost negligible for COCA, but shows a peak for iSearch at 0.01.
We investigated the output of the EM-algorithm over the iterations in order to find out why \(\lambda\) has little effect for these data. We see that for most topics, only two or three iterations are needed for the estimated probabilities to converge. We speculate that since the most informative terms converge very fast, the contrast of their frequencies between the foreground and the background corpus is apparently sufficiently large to receive a high probability, independent of the weight of the background corpus.
In the remainder of this section, we use \(\lambda =0.01\) for PLM. For KLIP, we set \(\gamma =1.0\) because we evaluate the informativeness component and not use the phraseness component. We use the topics 032–066 from the iSearch data to compare the methods. The results are in Table 8.
Table 8 The effect of the background corpus in three different informativeness methods, for the Personalized Query Suggestion collection, in terms of Mean Average Precision
Table 8 shows that the domain-specific iSearch corpus gives better results than the generic COCA for all three methods. For FP, this difference is significant at the 0.05-level. The differences between the three methods PLM, FP and KLIP are not significant on the 0.05 level: a paired t test for the largest difference (between KLIP and PLM with iSearch as background collection) gives \(p=0.111\). Table 9 illustrates the output for the FP method with the two different background corpora. Many terms overlap, although their ranking is different.
Table 9 Example output of FP with iSearch and COCA as background corpus for the Personalized Query Suggestion collection: the top-10 terms extracted from the relevant documents in the iSearch collection for one topic (045), “Models of emerging magnetic flux tubes”
In Sect. 5.2.3 we come back to these results and provide some more insight on the effect of the background collection.
Comparing methods with different background corpora in the QUINN collection
For the QUINN collection, we compare two different background corpora for extracting potential query terms from news articles of the last 30 days for a given query:
-
(a)
an older result set for the same query: all news articles matching the query that were published between 60 and 30 days ago;
-
(b)
a generic news collection. Since the QUINN collection is Dutch, we use the newspaper section from the SoNaR-corpus (Oostdijk et al. 2008), 50 Million words in total, for this purpose.Footnote 14
Of these two corpora, (a) is topic-related and thereby highly domain-specific, even more than the iSearch corpus was for Personalized Query Suggestion in academic search (see the previous section), and (b) is more generic but from the same genre as the foreground collection (Dutch newspaper texts).
We use both background corpora for extracting terms with PLM, FP and KLIP (\(\gamma =0.5\)) and evaluate the quality of the extracted terms using two user-based evaluation measures: the percentage of searches with a term from top-5 selected by the user, and the percentage of searches with at least 1 relevant term (a relevance rating \(>=4\) on a 5-point scale) in top-5. The results are in Figs. 5 and 6.
The figures show consistently better results for the generic newspaper background corpus than for the topic-related background corpus. A McNemar’s test for paired binary samplesFootnote 15 shows that the difference between the two corpora is significant on the 0.01 level for PLM (\(p=0.0036\)) and on the 0.05 level for FP (\(p=0.037\)) and KLIP (\(p=0.034\)). It is surprising that the generic background corpus gives better results than the domain-specific corpus, considering the results in the previous subsection, where the domain specific iSearch corpus seemed to be give better results than the generic COCA. We had a detailed look at the terms generated using either of the two background corpora. Two example queries with their term suggestions are shown in Table 10.
In the example on Biodiversity, the terms generated with two background corpora show quite some overlap, but in the example on ICT policy, the two term lists are completely different. In both cases, the terms generated with the topic-related background corpus are more specific than the terms generated with the generic background corpus. In other words, the comparison between the news from the last 30 days to a generic newspaper corpus leads to terms that are relevant for the topic in general, while the comparison between the news from the last 30 days and the news on the same topic from 60-30 days ago leads to terms that are very specific for the most recent developments on the topic. Hence, the second example topic contains a few names of places (Westrozebeke, Moorslede) that were in the news during the last 30 days. This leads us to the conclusion that a domain-specific background corpus is good, but this domain should not be too narrow (such as a corpus covering one news topic).
Table 10 Generated terms for two example topics using PLM with two different background corpora
Discussion: what is the influence of the background collection?
Since the term scoring methods were designed for different purposes, the choice of background corpus and the term scoring method are expected to be interdependent. Specifically, PLM was designed for modelling a single document in the context of a larger collection, while KLIP and FP were designed for contrasting two collections. Hence our hypothesis:
Hypothesis: Three methods use a background collection: PLM, FP and KLIP. Of these, we expect PLM to be best suited for term extraction from a foreground collection (or document) that is naturally part of a larger collection, because the background collection is used for smoothing the probabilities for the foreground collection. FP and KLIP are best suited for term extraction from an independent document collection, in comparison to another collection. KLIP is expected to generate better terms than FP because KLIP’s scoring function is a-symmetric: it only generates terms that are informative for the foreground collection.
With term extraction for query suggestion in the scientific domain (the Personalized Query Suggestion collection, Sect. 5.2.1), we had relatively small collections—2250 words on average per topic—that are part of the background collection. For this type of collections we would expect that PLM would outperform FP and KLIP. The results that we got in terms of Mean Average Precision (Table 8) seem to indicate that PLM indeed is a bit better than the other methods, but these differences are not significant. This is probably due to the strictly defined baseline (a small set of human-formulated query terms). Throughout all experiments we have seen that FP and KLIP perform similarly. We already noted in Sect. 3.2 that the two methods are similar to each other. The a-symmetry of the KLIP function explains why its performance is a little better than FP in Fig. 1. This confirms the second part of our hypothesis.
We investigated the effect of the domain-specificity of the background corpus by comparing two background collections of different specificity for two tasks: For Personalized Query Suggestion we compared a domain-specific background collection of scientific literature (iSearch) to a background collection of general English language (COCA); and for QUINN we compared a topic-specific background collection to a more general background collection for the same genre (the Dutch-language newspaper collection Sonar). In the first case we found that the domain-specific background collection gave better results than the general-domain collection, and in the second case we found that the more general background collection gave convincingly better results than the highly specific corpus. This suggests that a background collection in the same language and genre as the foreground collection (such as English scientific articles or Dutch newspaper articles) gives good results, but a topic-specific background corpus seems a step too far in terms of domain-specificity.
In order to provide some more insight in the effect of the background collection on the generated terms, we analyzed the coverage of the background collections for the generated terms. The coverage of the background collection is relevant for term weighting because terms that do not occur in the background corpus are scored based on the frequency criterion only. In other words, the absence of a term in the background collection implies a high specificity of the term for the foreground collection. For relevant terms that are highly specific for a topic the absence in the background collection reflects their high specificity. However, if the coverage of the background corpus is too low, less relevant terms receive high scores because of their specificity relative to the background collection. We investigated the coverage of the background collections for the two tasks in this section.
In the case of Personalized Query Suggestion (Sect. 5.2.1) with iSearch as background corpus, all candidate terms are part of the background collection, since the foreground collection (set of retrieved documents) is a subset of the background collection. In the case of COCA as background corpus, not all candidate terms are part of the background collection, because COCA is an independent corpus. We compared the COCA word list with the top-ranked (top-10) terms that were generated by KLIP, FP and PLM for each of the topics. We found that on average, 76 % of the generated terms are present in COCA. Examples of these terms are electron, mirror, cavity and pressure. Examples of terms that are not in COCA are ferromagnetic, waveguides, excitons and nanoclusters. Figure 7 shows the proportion of the top-10 terms generated for the topics 032–065 that are present in the background collection COCA. For some topics, the terms are highly specific, e.g. topic 062 with terms such as magnetohydrodynamic, while for other topics the generated terms are much more frequent in general language, e.g. topic 042 with terms such as electricity and energy.
In the case of QUINN (Sect. 5.2.2), we first investigated the coverage of the SONAR corpus: We compared the SONAR word list with the top-ranked (top-10) terms that were generated by KLIP, FP and PLM for each of the topics. We found that on average 71 % of the generated terms are present in SONAR. Examples of these terms are rotterdam, studenten (‘students’) and lachgas (‘laughing gas’). The terms that are not in SONAR are very specific terms such as schaliegas (‘schale gas’), spurious multi-word phrases such as vooral kool (‘mainly cole’), and proper names such as robin batens and tsipras. Then we investigated the coverage of the topic-specific background collections (for each query a set of news articles retrieved for the same query but published earlier than the articles in the foreground collection). We found that on average, only 51 % of the terms are present in this specific background collection. Examples are again terms such as rotterdam, studenten (‘students’) en ziekenhuis (‘hospital’). The terms that are not in the topic-specific background collection are in some cases again proper names such as annelies and spurious multiwords such as greenpeace roept (‘greenpeace calls’), but also terms that are more general but did not occur in the small background collection for the topic, such as zee (‘sea’), wall street en goede doelen (‘charity funds’). Figure 8 shows a comparison between the coverage of both background corpora used for QUINN for 15 topics. It shows a large diversion between topics, just as in the case of Personal Query Suggestion, but in all topics the SONAR corpus has a larger coverage than or equal to the topic-specific corpus.
A summary of the coverage of the background corpora and their quality (for one example method, PLM) is shown in Table 11. The table shows that for both tasks, the background collection with the highest coverage gives the best results.
Table 11 Summary of the coverage of the background corpora and their quality (for one example method, PLM)
What is the influence of multi-word phrases?
We investigate the balance between informativeness and phraseness for the two collections for which we have ground truth terms available: Author Profiling and Personalized Query Suggestion. We run KLIP on both collections. In Personalized Query Suggestion, we use the iSearch collection as background corpus. We evaluate values for \(\gamma\) in Eq. 13 ranging from 0.0 (Phraseness only) to 1.0 (Informativeness only) with steps of 0.1 The results are in Fig. 9.
Again, we see that mean average precision is much lower for the Personalized Query Suggestion collection than for the Author Profiling collection. This is because the ground truth is very strictly defined in the Personalized Query Suggestion collection. More interestingly, the effect of gamma is very different between the two collections: the phraseness component should be given much more weight in the Author Profiling collection than in the Personalized Query Suggestion collection. This is surprising because the proportion of multi-word phrases in the ground truth set is very similar for both collections. We had a more detailed look at the output of KLIP for both collections to see what causes this difference. Table 12 shows the top-5 terms for one user in the Author Profiling collection and one topic in the Personalized Query Suggestion collection, ranked using KLIP with different values of \(\gamma\).
Table 12 Example output of KLIP with different values of \(\gamma\) for one user in the Author Profiling collection and for one topic in the Personalized Query Suggestion collection
The table shows that in the Author Profiling collection, multi-words have already disappeared from the top-5 when \(\gamma =0.3\), while in the Personalized Query Suggestion collection, three out of five terms are still multi-words for the same value of \(\gamma\). Even if we set \(\gamma =1.0\) (informativeness only), the top-10 terms for the example topic still contains three multi-words.Footnote 16 A more detailed look of the output for both collections reveals that over all users and topics, more multi-words are extracted from the data in the Personalized Query Suggestion collection than in the Author Profiling collection (also using other term scoring methods). The most probable explanation for this is that each topic in the Personalized Query Suggestion data covers a very narrow domain. We extract terms from the documents that are relevant to this narrow domain. In these documents, some multi-word terms (e.g. ‘biological cells’) are highly frequent, not only compared to other multi-word terms but even compared to single-word terms.
Discussion: What is the influence of multi-word phrases?
In Sect. 5.1.1, we showed that the phraseness methods outperform the informativeness methods for author profiling. The reason is that in this collection, the human-defined ground truth has a large proportion of multi-word terms. The results confirm our hypothesis:
Hypothesis: We expect C-Value and KLIP to give the best results for collections and use cases where multi-word terms are important. CB, PLM and FP are also capable of extracting multi-words but the scores of multi-words are expected to be lower than the scores of single-words for these methods. On the other hand, C-Value cannot extract single-word terms, which we expect to be a weakness because single-words can also be good terms.
When comparing informativeness methods and phraseness methods for a given collection, two aspects play a role: Multi-word terms are often considered to be better terms than single-word terms (see Sect. 5.3). On the other hand, multi-word terms have lower frequencies than single-word terms (see Sect. 3.1), which makes them sparse in small collections. In the case of a small collection, consisting of 1 or a few documents, the frequency criterion will select mostly single-word terms. For that reason, KLIP performs better than C-Value. In addition to that, we also saw in Sect. 5.1.1 that KLP without the informativeness criterion also outperforms C-Value. As we pointed out in Sect. 3.3, both methods select terms on the basis of different criteria: In C-Value, the score for a term is discounted if the term is nested in frequent longer terms (e.g. the score for ‘surgery clinic’ would be discounted because it is embedded in the relatively frequent term ‘plastic surgery clinic’). In KLP, on the other hand, the frequency of the term as a whole is compared to the frequencies of the unigrams that it contains; the intuition is that relatively frequent multi-word terms that are composed of relatively low-frequent unigrams (e.g. ‘ad hoc’, ‘new york’) are the strongest phrases. We found that the KLP criterion tends to generate better terms than the C-Value criterion.
In Sect. 5.3 we saw that if we combine informativeness and phraseness in one term scoring method, the optimal weight of the two components depends on the collection at hand. In general, the importance of multi-word phrases depends on three factors:
-
Language. In compounding languages such as Dutch and German, noun compounds are written as a single word, e.g. boottocht ‘boat trip’. In English, these compounds are written as separate words. As a result, the proportion of relevant terms that consist of multiple words is higher for English than for a compounding language such as Dutch. For example, the proportion of multi-words in the user-formulated Boolean queries for the Dutch collection QUINN is only 16 %. The proportions of multi-word phrases in the ground truth term lists for Author Profiling and Personalized Query Suggestion are 50 and 57 % respectively. This implies that (a) we cannot generalize the results in this paper to all languages and (b) although it is to be recommended to tune the \(\gamma\) parameter for any new collection, this is even more important in the case of a new language.
-
Domain. In the scientific domain (in our case the Author Profiling and Personalized Query Suggestion collections), more than half of the user-selected terms are multi-word terms. A method with a phraseness component is therefore the best choice (KLIP with a low \(\gamma\) or C-Value) for collections of scientific English documents.
-
Use case and evaluation method. For Author Profiling, multi-word terms are highly important if the profile is meant for human interpretation (such as keywords in a digital library, or on an author profile): human readers prefer multi-word terms because of their descriptiveness. This implies that when terms are meant for human interpretation, a method with a phraseness component is the best choice (KLIP with a low \(\gamma\) or C-Value). On the other hand, in cases where terms are used as query terms, single-word terms might be more effective, and PLM or FP would be preferable.