The uncertain representation ranking framework for conceptbased video retrieval
 2k Downloads
 2 Citations
Abstract
Concept based video retrieval often relies on imperfect and uncertain concept detectors. We propose a general ranking framework to define effective and robust ranking functions, through explicitly addressing detector uncertainty. It can cope with multiple conceptbased representations per video segment and it allows the reuse of effective text retrieval functions which are defined on similar representations. The final ranking status value is a weighted combination of two components: the expected score of the possible scores, which represents the riskneutral choice, and the scores’ standard deviation, which represents the risk or opportunity that the score for the actual representation is higher. The framework consistently improves the search performance in the shot retrieval task and the segment retrieval task over several baselines in five TRECVid collections and two collections which use simulated detectors of varying performance.
Keywords
Representation uncertainty Conceptbased representation Video retrieval1 Introduction
Conceptbased video retrieval has many advantages over other contentbased approaches (Snoek and Worring 2009). In particular, it is more straightforward to define ranking functions on conceptbased representations than for most other contentbased representations (Naphade et al. 2006). For example, the definition of a ranking function for the query “Find me tigers” is intuitively more straightforward based on the concept Animal in a (video) segment ^{1} than based on the color distribution in an example image. As the current stateofthe art in automatic concept detection is not mature enough for ranking functions directly using the binary concept labels occurs/absent (Hauptmann et al. 2007), conceptbased search engines use the confidence score of a detector that the concept occurs. However, the uncertainty introduced by the use of confidence scores makes the definition of effective and robust ranking functions again more difficult. This paper presents a general framework for the definition of conceptbased ranking functions for video retrieval that fulfill these requirements.
Research in conceptbased retrieval currently focuses on the retrieval of video shots, which are segments of roughly five seconds length. According to Kennedy et al. (2008) the main problem here is the definition of queryspecific ranking functions, which are often modeled as weighted sums of confidence scores. But the estimation of weights based on semantic distance of the concept to the query or on relevance feedback has proven difficult, which leads to poor performance (Aly et al. 2009). Another approach learns weights for a set of query classes based on relevance judgments for training queries (Yan 2006). However, the gathering of relevance judgments for training queries is expensive and it is unclear how to define a suitable set of query classes. Additionally, although de Vries et al. (2004) find that users do not only search for shots but also for longer segments, conceptbased search engines do not support this retrieval task. A likely reason is that a single confidence score per segment does not sufficiently discriminate relevant from nonrelevant segments. However, it is not straightforward to define a more discriminative document representation based on confidence scores. Therefore it is an important challenge to come up with a framework to define ranking functions for varying retrieval tasks that are effective for arbitrary queries.
The performance of detectors changes significantly with the employed detection technique and the considered collection (Yang and Hauptmann 2008). If a ranking function strongly depends on a particular distribution of confidence scores, its performance varies, which is clearly undesirable. For example, the confidence scores of the concept Animal in relevant shots for the query “Find me tigers” can be high in one collection and low in another collection. Now, if a ranking function assumes that confidence scores for the concept Animal in relevant shots are high, its performance will be poor for the second collection. Because current ranking functions are weighted sums of confidence scores they rely on the weight estimation to adapt the weights according to the score distribution of the considered collection. However, how could we estimate these weighted for arbitrary detectors and collections? Therefore it is also an important challenge to define robust ranking functions over detectors of varying performance.

they are effective for arbitrary queries, and

they are robust over detector techniques and collections.
To demonstrate that the framework produces effective and robust ranking functions, we show that this is the case for the shot retrieval task and the segment retrieval task. Note that the ranking functions used for these tasks originate from ideas which we proposed earlier. In Aly et al. (2008) we propose to rank shots by the probability of relevance given the confidence scores, marginalizing over all possible concept occurrence. The ranking function obtained through marginalization is equal to the expected score used in the URR framework. The expected score allows us to additionally model the risk of choosing a certain score. Furthermore, in Aly et al. (2010) we propose a ranking function for segment retrieval, where the idea of ranking by the expected score and the scores’ standard deviation is used for the first time for a concept language model ranking function and a document representation in terms of concept frequencies. The URR framework generalizes this idea to arbitrary ranking functions and representations.
The remainder of this paper is structured as follows. First in Sect. 2 related work on treating uncertainty in information retrieval is presented. In Sect. 3 we describe the proposed URR framework. In Sects. 4 and 5 the framework is applied to shot and segment retrieval respectively. Then Sect. 6 describes the experiments which we undertook to evaluate the URR framework. Section 7 discusses the experimental results. Finally, Sect. 8 presents the conclusions.
2 Related work
In this section we describe how related work approaches uncertainty, both in conceptbased video retrieval and in text retrieval. Note that there are significant bodies of research on the storage of uncertainties in databases, see for example Benjelloun et al. (2006), and on the exploitation of uncertain knowledge representations for the inference of new knowledge, see for example Ding and Peng (2004), which lie outside the scope of this paper.
2.1 Conceptbased video ranking functions
Most conceptbased video ranking functions use confidence scores of detectors built from support vector machines. To ensure comparability of confidence scores among concepts, confidence scores are usually normalized. Platt (2000) provides a method to transform a confidence score into a posterior probability of concept occurrence given the confidence score, which we refer to as probabilistic detector output.
In uncertainty class UC1, ranking functions (indicated by score) take confidence scores as arguments. Most ranking functions are weighted sums or products of confidence scores, where the used weights carry no particular interpretation (Snoek and Worring 2009). Yan (2006) proposes the Probabilistic Model for combining diverse knowledge sources in multimedia. The proposed ranking function is a discriminative logistic regression model, calculating the posterior probability of relevance given the observation of the confidence scores. Here the confidence score weights are the coefficients of the logistic regression model. The ranking functions of uncertainty class UC1 mainly have the problem that they require knowledge about the confidence score distributions in relevant shots, which is difficult to infer. Additionally, if a concept detector changes, the distribution of confidence scores changes, making existing knowledge obsolete.
In uncertainty class UC2, ranking functions are based on the (inverse) rank of the confidence scores within the collection (McDonald and Smeaton 2005; Snoek et al. 2007). As only the ranks of confidence scores are taken into account, estimating weights for this uncertainty class only requires knowledge over the distribution of confidence scores in relevant shots relative to other shots. Otherwise UC2 suffers from the same drawbacks as UC1.
In uncertainty class UC3, ranking functions take a vector of the most probable concept representation as arguments. To the best of our knowledge, no method of this class was proposed in conceptbased video retrieval so far, most likely due to the weak performance of concept detectors. Nevertheless, we include this uncertainty class in our discussion because methods of this class have been used in spoken document retrieval, where the most probable spoken sentence is considered (Voorhees and Harman 2000), and once concept detectors improve, ranking functions from this class might become viable.
In uncertainty class UC4, ranking functions use a particular concept representation, not necessarily the most probable, together with its probability. Zheng et al. (2006) propose the pointwise mutual information weight (PMIWS) ranking function. As we showed in Aly et al. (2008), the PMIWS can be seen to rank by the probability of relevance given the occurrence of all selected concepts multiplied by the probability that these concepts occur in the current shot. The main problem of instances of uncertainty class UC4 is that concepts which only occur sometimes in relevant shots cannot be considered. To see this, let us assume perfect detection, a concept that occurs in 50 % of the relevant shots, and a ranking function that only rewards shots in which this concept occurs. Here, relevant shots, in which the concept does not occur, receive zero score.
In uncertainty class UC5, ranking functions take the expected components of concept occurrences as parameters. Li et al. (2007) propose an adaptation of the language modeling framework (Hiemstra 2001) to conceptbased shot retrieval. We show in (Aly 2010, p. 32) that the ranking function by Li et al. (2007) can also be interpreted as using the expected concept occurrence in the language modeling framework where concepts (terms) either appear or not. Instead of focusing on one representation, as done by UC3 and UC4, this uncertainty class combines all possible representations into the expected values of a representation, which is then used in a ranking function. The ranking functions of uncertainty class UC5 are limited to arguments of real numbers because they are defined on expectations, which are real numbers. But some existing effective probabilistic ranking functions, for example the binary independence model (Robertson et al. 1981), are defined on binary arguments, and therefore cannot be used. Furthermore, the ranking functions in uncertainty class UC5 result in a single score, which abstract from the uncertainty that is involved by using this result.
The URR framework proposed in this paper can be seen as a general ranking framework of a new uncertainty class (UC6) of ranking functions that are defined on the distribution of all possible conceptbased representations of a document. The URR framework uses a basic ranking function to calculate a score for each possible representation. The final ranking score value of a document is then calculated by combining the expected score and the scores’ standard deviation according to the probability distribution over the possible representations for this document. This procedure has the following advantages. Compared to the uncertainty classes UC1 and UC2, the basic ranking function of the URR framework does not require knowledge about the distribution of confidence scores in relevant segments. In contrast to the uncertainty classes UC3 and UC4, which both only use a single conceptbased representation, the URR framework takes into account all possible representations, which reduces the risk of missing the actual representation of a document. Finally, compared to uncertainty class UC5, the basic ranking functions in the URR framework are defined on conceptbased representations, which allow us to reuse existing, effective ranking functions from text retrieval. Additionally, the scores’ standard deviation in the URR framework can be seen as a measure of the riskiness of score, which we show can be used in ranking.
2.2 Uncertainty in text retrieval
We are not the first to address uncertainty in information retrieval, which has been done before in text retrieval, for example, in probabilistic indexing and in the recently proposed meanvariance analysis framework for uncertain scores, as well as in several other areas. We describe the former two approaches in the following.
2.2.1 Probabilistic indexing
In probabilistic indexing for text retrieval, the assignment of an index terms to a document is only probabilistically known. Croft (1981) approaches this uncertainty by ranking documents according to the expected score of the binary independence ranking function (Robertson et al. 1981). However, Fuhr (1989) shows that, although the binary independence ranking function is a rank preserving simplification of the probability of relevance function, the expected binary independence score is not rank preserving to the expected probability of relevance score. Instead, Fuhr (1989) ranks by the probability of relevance given the confidences of indexers as a ranking function, marginalizing over all possible index term assignments. This marginalization is equivalent to ranking by the expected probability of relevance, which we use as a ranking component of our URR framework in Sect. 4.
Note that there is a difference in interpretation between the marginalization and the expected score used in the URR framework, which we discuss in the following. The marginalization approach considers for each document the probability of relevance of any document with the same indexer confidences, which are similar to confidence scores in conceptbased video retrieval. On the other hand, the URR framework uses the expected score of a particular document. This allows us to consider the scores’ standard deviation, which represents the risk or opportunities of ranking a document by its expected score. Additionally, Fuhr assumes that the true index term assignments of a document are always unknown, but for the URR framework concept occurrences are only uncertain because of the uncertainty of detectors. Indeed, the URR framework could be extended to handle the case where the occurrences of some concepts are known, which we propose for future work. Additionally to the expected score, the URR framework considers a component to represent the risk inherent to a retrieval model when ranking a document.
2.2.2 Meanvariance analysis
Wang (2009) proposes the meanvariance analysis framework for managing uncertainty in text retrieval, which is based on the Portfolio Selection Theory (Markowitz 1952) in finance. We believe that the processes in finance are more intuitive, therefore we first describe the Portfolio Selection Theory and describe its application to text retrieval afterwards.
 1.
The expected win, E[d _{ j }.Win] (“What win is to be expected from the company d?”).
 2.
The variance of the win, var[d _{ j }.Win] (“How widely do the possible wins vary?”).
 3.
The covariance between the win of company d and any other company d _{ j }, cov[d _{ j }.Win, d _{ i }.Win] (“How does the win of company d _{ j } influence the win of company d _{ i }?”).
The URR framework uses a similar ranking algorithm to the one proposed in Eq. (3), using the scores’ standard deviation instead of its variance. In the meanvariance analysis, the reason for the uncertainty of a document’s score is unspecified. On the other hand, in the URR framework the scores’ standard deviation originates from the uncertain document representation. Similar to the meanvariance analysis, the URR framework could also take into account correlations between document representations, to influence the standard deviation of the score. For example, videos usually follow a story and the occurrence of concepts in nearby shots are correlated (the fact that an Animal occurs in a shot influences the probability of an Animal in a nearby shot). Yang and Hauptmann (2006) are the first to explore the exploitation of such correlations in videos. As until now only oracle models trained on the test collection were able to achieve significant improvements, we leave the consideration of covariances, although promising, to future work.
3 The uncertain representation ranking framework
This section describes the URR framework which ranks segments by considering uncertain conceptbased representations in a similar way as the MeanVariance framework (Wang 2009)^{3}.
3.1 Intuitive example
A search engine in a riskneutral will rank document d _{2} above document d _{1} because it has a higher expected score. However, similar to the analysts in the previous section, the search engine in a riskloving setting might prefer document d _{1} over document d _{2} because of the higher probability that the document has the highest score of 60. In the following section we define the URR framework, which generalizes this intuitive case to arbitrary score functions defined on arbitrary concept representations.
3.2 Definitions
Because the URR ranking framework is not specific to a particular type of feature, let \({\bf F}=(F_1,\ldots,F_n)\) be the considered representation of documents for the current query consisting of n features (or representation). Formally, each feature F _{ i } is a random variable, a function of documents to feature values. For example, the ranking functions in this paper consider concept occurrences, denoted by Cs, and concept frequencies, denoted by CFs, as features. For the query “Find me tigers”, a search engine might consider the frequencies of the concept Animal and the concept Jungle CF = (CF _{1},CF _{2}) as features where CF _{1}(d) and CF _{2}(d) yield the frequency of the concept Animal and the concept Jungle in document d respectively.
Furthermore, let \(score: rng({\bf F}) \rightarrow \hbox{IR}\) be a ranking function which maps known feature values to scores, where \(rng(\cdot)\) denotes the range of a function. For example, the simple ranking function in Eq. (4), \(score({\bf f} \in rng({\bf F}))=\sum\nolimits_i{w_i\; f_i}\) where w _{ i } is the weight feature value f _{ i }, is such a score function. Note that we adopt the common notation of random variables and denote random variables and functions in the same way as their range, therefore leaving out \(rng(\cdot)\) in the following (Papoulis 1984).
Because the feature values of documents are uncertain, we introduce the random variable d.F for the feature values of document d. Furthermore, let d.S = score(d.F) be the random variable for the score of document d which results from the application of the ranking function score on d’s uncertain feature values d.F. For example, if a segment contains m shots and the considered representation consists of n concept frequencies, the random variable of the uncertain concept frequencies CF d ranges over (m + 1)^{ n } possible frequency combinations, and the random variable d.S ranges over the scores obtained from the application of score on each combination.
It is important to note the difference between the random variables F and the ranking function score on the one hand, and its documentspecific counter parts d.F and d.S on the other hand. For example, score(F(d)) is the actual score of document d based on the known features F(d). On the other hand, d.F and d.S are random variables for the possible feature values and their corresponding scores of document d.
We denote the posterior probability of a document d having representation values \(f \in d.{\bf F}\) given the confidence scores o as P(d.F = fo), which we use to calculate the expected score and its standard deviation.
3.3 Ranking framework
4 Shot retrieval
In this section we describe an adaptation of the URR framework to shot retrieval in which the expected score component is equivalent to the Probabilistic Framework of Unobservable Binary (PRFUBE), which was originally proposed by Aly et al. (2008). Additional to the expected score, we define the scores’ standard deviation. For consistency reasons we use the name PRFUBE for our method for shot retrieval, despite the additional consideration of the scores’ standard deviation.
4.1 Representation and ranking function
4.2 Framework integration
5 Segment retrieval
In this section we describe the Uncertain Concept Language Model (UCLM) ranking function for segment retrieval, which was originally presented in Aly et al. (2010). While the original publication already contained the main ideas of the URR framework, it was specific to the representation of document representations of concept frequencies and concept language model as a ranking function. In this paper, we describe the UCLM as an instance of the URR framework.
5.1 Representation and ranking function
5.2 Framework integration
6 Experiments
In this section we present the experiments which we undertook to evaluate the performance of the URR framework. We investigated two retrieval tasks in connection with the annual TRECVid evaluation workshop (Smeaton et al. 2006): the automatic shot retrieval task, which is a standard task in TRECVid, and the segment retrieval task, which we proposed earlier to accommodate the user’s need to search for longer segments (Aly et al. 2010). Note that because we focus on purely conceptbased search the performance figures presented in this section are not directly comparable with figures reported elsewhere which also use features such as text and visual similarity.
6.1 Experiment setup
Statistics of the collections used in the experiments
Collection  Shots  Domain  Queries  Detectors sets  Number of concepts  Training collection for ADCS 

tv05t  45,765  News  24  MM101  101  tv05d 
tv06t  79,484  News  24  Vireo  374  tv05d 
tv07t  18,142  G.TV  24  Vireo  374  tv05d 
tv08t  35,766  G.TV  48  Vireo  374  tv05d 
tv08t  35,766  G.TV  48  MM09  64  tv07d 
tv09t  61,384  G.TV  24  MM09  64  tv07d 
Before we execute a query we first needed to select concepts and estimate the corresponding ranking function parameters. We used the AnnotationDriven Concept Selection (ADCS) which showed good performance on several collections (Aly et al. 2009). The ADCS method is based on a collection with known concept occurrences and textual shot descriptions. The probability of a concept occurrence given relevance was estimated by executing the textual query on the shot descriptions and using the known concept occurrences for the estimation of the probability (Aly et al. 2009). The shot descriptions consisted of the automatic speech recognition output together with the corresponding Wikipedia articles of the occurring concepts. We used the generalpurpose retrieval engine PF/Tijah (Hiemstra et al. 2006) to rank the shot descriptions in the training collection. The parameter m of the ADCS method states the numbers of topranked shot descriptions we assume are relevant. For each concept, the method estimates the probability of the concept’s occurrence given relevance, P(CR). To select concepts, we used these estimates together with the concept priors to calculate the Mutual Information between a concept and relevance which was identified by Huurnink et al. (2008) as a measure of usefulness. From the resulting ranked list of concepts, we selected the first n concepts.
The performance of current concept detectors is still limited, and the resulting search performance is low compared to, for example, performance figures from text retrieval. Therefore we also used our simulationbased approach (Aly et al. 2012) to investigate the search performance of the considered ranking functions with increased detector performance. This is in line with work reported in Toharia et al. (2009) which artificially varied the quality of concept detector performance in order to study the impact of improving or degrading this, on retrieval.
In the simulation the confidence scores of the positive and the negative class of known concept occurrences are modeled as Gaussian distributions. Changes in detector performance are simulated by changing the Gaussians’ parameters. For each concept in each shot we generated confidence scores randomly from the Gaussian corresponding to the concept occurrence status. On the resulting collection of confidence scores, we executed the considered ranking functions, resulting in the average precision of each method with these confidence scores. We repeated this procedure 25 times, yielding an estimation of the search performance we would expect for retrieval using detectors with these parameters. To keep our discussion focused, we only investigate the search performance when changing the confidence scores’ mean of the positive class—therefore making the detector on average more confident about the concept occurrences. For a more detailed description of this simulation approach, we refer the interested reader to Aly et al. (2012).
6.2 Shot retrieval
Considered ranking functions (Rank Func.) for shot retrieval (c′ binary detector output \((P(Co)>0.5\rightarrow c'=1),\,p=P(CR),\,q=P(C\bar{R})\sim P(C)\))
Rank Func.  Description  Definition 

CombMNZ  Multiply nonzero  \(\prod\nolimits_{i}{P(C_io_i)}\;\hbox{with}\; P(C_io_i)>0\) 
CombSUM  Unweighted sum of scores  ∑_{ i } P(C _{ i }o _{ i }) 
PMIWS  Pointwise mutual information weighting scheme  \(\sum\nolimits_i\;{\log\left({\frac{P(C_iR)}{P(C_i)}}\right)P(C_io_i)}\) 
Borda  Rank based  ∑_{ i } rank(P(C _{ i }o _{ i })) 
BIM  Binary independence model  \(\sum\nolimits_{i}\;{c'_i \log\Big({\frac{p(1q)}{q(1p)}}\Big)}\) 
ELM  Expected concept occurrence language model (λ = 0.1)  \(\prod\nolimits_i\;{\left[\lambda P(C_io_i) + (1\lambda) P(C_i\mathcal{D})\right]}\) 
6.2.1 Risk parameter study
6.2.2 Performance comparison
Mean average precision of the ranking functions described in Table 2
Collection  tv05t  tv06t  tv07t  tv08t  tv08t  tv09t  Avg. rank 

Rank Func.  MM101  Vireo  Vireo  Vireo  MM09  MM09  
CombMNZ  0.064  \(0.033^{\dagger}\)  0.028  \(0.024^{\dagger}\)  \(0.042^{\dagger}\)  \(0.045^{\dagger}\)  4.7 
10/8  700/30  100/20  10/15  100/30  100/10  
PMIWS  0.054  0.039  0.021  \({\mathbf 0.041}\)  0.058  0.067  2.7 
100/8  200/30  200/15  50/4  50/4  50/2  
Borda  \(0.050^{\dagger}\)  \(0.012^{\dagger}\)  \(0.020^{\dagger}\)  0.030  \(0.045^{\dagger}\)  0.058  5.5 
10/15  100/10  50/20  10/15  10/2  10/8  
BIM  \(0.044^{\dagger}\)  \(0.024 ^{\dagger}\)  0.026  0.037  0.050  0.063  4.8 
10/8  100/2  100/8  100/4  50/2  50/2  
ELM  0.071  0.040  0.031  0.040  0.050  0.064  2.3 
10/8  600/30  50/10  100/4  10/2  50/2  
PRFUBE  0.069  0.043  0.039  0.041  0.056  0.068  1.5 
150/10  600/30  100/45  100/4  100/4  50/2 
6.3 Segment retrieval
We now describe the experiments we undertook to evaluate the performance of the UCLM ranking function from Sect. 5 for segment retrieval. Because of the novelty of the segment retrieval task there is no standard set of queries. Therefore we decided on using the official queries for the tv05t and tv06t collections, replacing the common prefix “Find shots of \ldots” with “Find news items about \ldots” . Furthermore, we assumed that a news item is relevant to a given query if it contains at least one relevant shot, which we determined from the relevance judgments for the respective shot retrieval task. We propose that for most queries this is realistic since the user could be searching for the news item as a whole, rather than for shots within the news item.^{5}
To rule out random effects when generating samples for the UCLM method, see Sect. 5, we repeated each run ten times and reported the average.
6.3.1 Risk parameter study
6.3.2 Performance comparison
Results of comparing the proposed UCLM framework against four other methods described in related work
Ranking function  tv05t  tv06t  

Concepts n  MAP  P10  Concepts n  MAP  P10  
CombMNZ  10  0.105  0.045  8  0.034  0.040 
PMIWS  6  0.102  0.080  2  0.050  0.065 
Borda  1  0.090  0.000  2  0.052  0.061 
Best1  5  0.094  0.245  6  0.073  0.083 
ECFLM  10  0.192  0.287  32  0.101  0.143 
UCLM  10  0.214^{*}  0.291  18  0.135^{*}  0.151 
6.4 Simulated concept detectors
Figure 7b shows the simulation results for the segment retrieval task. At low detector performance, the UCLM ranking function performs practically identical to the ECFLM ranking function. With a higher detector performance, the UCLM ranking function wins in performance. The Best1 ranking function increases performance only with much higher detector performance.
6.5 Influence of the scores’ standard deviation
7 Discussion
We now discuss the experimental results obtained in the previous section.
7.1 Effectiveness
Both derivations of the URR framework, PRFUBE and UCLM, showed significant improvement over most other retrieval methods from other uncertainty classes, as shown in Tables 3 and 4. Furthermore, according to the simulations presented in Fig. 7, both methods will also continue having a strong performance compared to other methods as concept detector performance improves.
7.2 Robustness
Given the relative low overall performance numbers, strong performance in some collections could be caused by particular “lucky” detections in relevant shots. Therefore, a robust retrieval method is not only effective (has good performance in many collections) but also stable (performs similar across collections). Table 3 shows that the PRFUBE is robust in six different collections. Similarly, the UCLM method performed stably for two collections. Furthermore, the detector simulation experiments in Fig. 7 suggest that the performance improvements are robust against changes of detectors.
7.3 Riskattitude
In both instances of the URR framework, a riskneutral or riskloving attitude helped performance. For the PRFUBE, the riskloving attitude did not increase performance. We propose that the almost monotonic relationship between expected score and standard deviation in Fig. 8 is the reason why the standard deviation does not improve the ranking for PRFUBE. We expect that the practically monotonic relationship of expected score and standard deviation of the PRFUBE originates from the independence assumptions made in Eq. (11)–(13), which are known not to match the data (Cooper 1995), and propose further investigations for future work. For the UCLM, there was much higher variability in the standard deviation compared to the expected scores, giving the standard deviation the possibility to improve the ranking. Here, a riskloving attitude improved performance significantly over the strongest baseline.
8 Conclusions
In summary, we proposed the URR framework that meets the challenge to define effective and robust ranking functions in conceptbased video retrieval under detector uncertainty. While the framework is independent of the retrieval task, we adapted it to the tasks of retrieving shots and (long) segments. For shot retrieval, our framework improved over five baselines on six collections, and for segment retrieval, it improved significantly over four baselines on two collections. Furthermore, when simulating improved concept detectors these improvements prevailed. We now discuss our conclusions in more detail.
The URR framework considers basic ranking functions adapted from text retrieval based on representations of known concept occurrences. The uncertainty of detectors is handled separately: the framework takes into account multiple conceptbased representations per document. It uses the confidence scores of detectors to assign each representation a probability of being the correct representation. The application of the considered basic ranking function to the multiple representations results in multiple scores for each document. Inspired by the meanvariance analysis framework by Wang (2009), the URR framework ranks documents by the expected score plus a weighted expression of the scores’ standard deviation, which represents the chance that scores are actually higher than the expected score. We demonstrated the ability of the general framework to produce effective and robust ranking functions by applying it to two retrieval tasks: shot retrieval and segment retrieval.
For shot retrieval, the framework used the probability of relevance given concept occurrences as a ranking function, which was derived from the probability of relevance ranking function originally proposed in text retrieval (Robertson et al. 1981). In terms of mean average precision, this ranking function improved over six baselines, representing other approaches to detector uncertainty, on three out of six collections. For the collections where it showed poorer performance than others, those were not significant. When considering all queries of the six collections together, the improvements over all baselines were significant. For segment retrieval, we proposed that ranking functions should include the withinsegment importance when retrieving long segments. We used the concept frequency to represent the withinsegment importance. We calculated the expected score and scores’ standard deviation by Monte Carlo Sampling to reduce prohibitively large number of possible representations, using 200 samples. Based on the representation of concept frequencies we used the concept language model as a ranking function, which was originally proposed in Aly et al. (2010) and derived from language models in text retrieval, see Hiemstra (2001). We showed through simulation experiments that the search performance improves with improved detectors. Based on these results, we conclude that the application of the URR framework results in effective ranking functions.
For ranking functions to be robust, the URR framework explicitly modeled the riskneutral choice and the risk of choosing this score by the expected score and the scores’ standard deviation respectively. We found that a riskaverse attitude resulted in poor performance for both retrieval tasks. For shot retrieval, the consideration of the scores’ standard deviation did not improve over the condition in which only the expected score was used. ^{9} We found that the scores’ standard deviation often increased monotonically with the expected score, which prevents the standard deviation to influence the ranking. We attributed this behavior to the common independence assumptions made in IR, which are also made in the shot ranking function but often do not match the data (Cooper 1995). For the segment retrieval task, the use of the scores’ standard deviation significantly improved the search performance compared to the condition of exclusively using the expected score. For both retrieval tasks, the ranking functions derived from the URR framework performed between the best two systems over all considered collections and detectors. Based on these findings we conclude that the ranking functions derived from the URR framework also perform robust.
The URR framework makes few assumptions about the uncertain representation, which was done for the specific shot retrieval task and the segment retrieval task. As future work we therefore aim to apply the URR framework to other uncertain representations, for example the uncertain variants of spoken text generated by probabilistic automatic speech recognition, or the uncertain references to known entities in text retrieval. Finally, the URR framework does not consider the overall performance of concept detectors which recently received research interest (Yang and Hauptmann 2008). Therefore, we propose to extend the URR framework by measures which incorporate the overall detector performance.
Footnotes
 1.
We use the terms document and video shot or a longer video segment interchangeably as both refer to retrievable units of information.
 2.
We use similar notation to the unusual notation d _{ j }.Win throughout this paper to prevent an excessive amount of subscripts.
 3.
The URR framework was originally proposed in the PhD thesis of the first author (Aly 2010).
 4.
Note that the distribution of d.C is discrete, although the score might be realvalued. The reason is that the arguments to score, d.C, are discrete.
 5.
A similar assumption is made during the creation of relevance judgments for the text retrieval workshop TREC, where a document is relevant if a part of it is relevant.
 6.
We also investigated the use of the minimum or maximum confidence score but did not find any improvements.
 7.
Note that we used the development collection from Snoek et al. (2006) as a test collection since it contained more shots; making the simulation results more realistic.
 8.
For shot retrieval, we left out the CombMNZ ranking function since it has similar results to the PMIWS method. For segment retrieval, we left out the PMIWS since it performed similar to CombMNZ.
 9.
Note that the expected score is equivalent to ranking a marginalization approach which we originally proposed in Aly et al. (2008).
Notes
Acknowledgments
We would like to thank the researchers of the MediaMill project and the Vireo project for the contribution of their concept detector output. We also would thank the anonymous reviewers for their productive feedback. Part of the work reported here was funded by the EU Project AXES (FP7269980). Author DH was funded under grant 639.022.809 of the Netherlands Organization for Scientific Research, NWO. Author AD and AS were funded by Science Foundation Ireland under grant 07/CE/I1147. Author AD is also supported by the Irish Health Research Board under grant MCPD/2010/12.
Open Access
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.
References
 Aji, S. M., & McEliece, R. J. (2000). The generalized distributive law. IEEE Transactions on Information Theory, 46(2), 325–343. doi: 10.1109/18.825794.MathSciNetCrossRefzbMATHGoogle Scholar
 Aly, R. (2010). Modeling representation uncertainty in conceptbased multimedia retrieval, PhD thesis. University of Twente, Enschede. http://dx.doi.org/10.3990/1.9789036530538.
 Aly, R., Hiemstra, D., de Vries, A. P., & de Jong, F. (2008). A probabilistic ranking framework using unobservable binary events for video search. In CIVR ’08: Proceedings of the international conference on contentbased image and video retrieval 2008 (pp. 349–358). New York: ACM. doi: 10.1145/1386352.1386398.
 Aly, R., Hiemstra, D., & de Vries, A. P. (2009). Reusing annotation labor for concept selection. In CIVR ’09: Proceedings of the international conference on contentbased image and video retrieval. New York: ACM.Google Scholar
 Aly, R., Doherty, A., Hiemstra, D., & Smeaton, A. (2010). Beyond shot retrieval: Searching for broadcast news items using language models of concepts. In ECIR ’10: Proceedings of the 32th European conference on IR research on advances in information retrieval (pp. 241–252). Berlin, Heidelberg: Springer. Lecture Notes in Computer Science, Vol. 5993.Google Scholar
 Aly, R., Hiemstra, D., de Jong, F., & Apers, P. (2012). Simulating the future of conceptbased video retrieval under improved detector performance. Multimedia Tools and Applications, 60(1), 203–231. doi: 10.1007/s110420110818x.CrossRefGoogle Scholar
 Benjelloun, O., Sarma, A. D., Halevy, A., & Widom, J. (2006). Uldbs: Databases with uncertainty and lineage. In Proceedings of the 32nd international conference on very large data bases, VLDB Endowment, VLDB ’06 (pp. 953–964).Google Scholar
 Chia, T. K., Sim, K. C., Li, H., & Ng, H. T. (2008). A latticebased approach to querybyexample spoken document retrieval. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (pp. 363–370). New York, NY, USA: ACM. doi: 10.1145/1390334.1390397.
 Cooper, W. S. (1995). Some inconsistencies and misidentified modeling assumptions in probabilistic information retrieval. ACM Transactions on Information Systems, 13(1), 100–111. doi: 10.1145/195705.195735.CrossRefGoogle Scholar
 Croft, W. B. (1981). Document representations in probabilistic models of information retrieval. Journal of the American Society of Information Science, 32(6), 451–457.CrossRefGoogle Scholar
 Ding, Z., & Peng, Y. (2004). A probabilistic extension to ontology language owl. In System Sciences, 2004. Proceedings of the 37th Annual Hawaii international conference on, p. 10. doi: 10.1109/HICSS.2004.1265290.
 Fuhr, N. (1989). Models for retrieval with probabilistic indexing. Information Processing and Management, 25(1), 55–72.MathSciNetCrossRefGoogle Scholar
 Hauptmann, A. G., Yan, R., Lin, W. H., Christel, M., & Wactlar, H. (2007). Can highlevel concepts fill the semantic gap in video retrieval? A case study with broadcast news. In IEEE Transactions on Multimedia, Vol. 9–5, pp. 958–966. doi: 10.1109/TMM.2007.900150.
 Hiemstra, D. (2001). Using language models for information retrieval, PhD thesis. University of Twente, Enschede. http://purl.org/utwente/36473.
 Hiemstra, D., Rode, H., van Os, T. R., & Flokstra, J. (2006). Pftijah: Text search in an xml database system. In Proceedings of the 2nd international workshop on open source information retrieval (OSIR) (pp. 12–17). Seattle, WA, USA: Ecole Nationale Supérieure des Mines de SaintEtienne.Google Scholar
 Hsu, W. H., Kennedy, L. S., & Chang, S.F. (2006). Video search reranking via information bottleneck principle. In MULTIMEDIA ’06: Proceedings of the 14th annual ACM international conference on multimedia (pp. 35–44). New York, NY, USA: ACM. doi: 10.1145/1180639.1180654.
 Huurnink, B., Hofmann, K., & de Rijke, M. (2008). Assessing concept selection for video retrieval. In Proceedings of the first MIR conference’08.Google Scholar
 Jiang, Y. G., Yang, J., Ngo, C. W., & Hauptmann, A. (2010). Representations of keypointbased semantic concept detection: A comprehensive study. IEEE Transactions on Multimedia, 12(1), 42–53. doi: 10.1109/TMM.2009.2036235.CrossRefGoogle Scholar
 Kennedy, L., Chang, S.F., & Natsev, A. (2008). Queryadaptive fusion for multimodal search. Proceedings of the IEEE, 96(4), 567–588. doi: 10.1109/JPROC.2008.916345.CrossRefGoogle Scholar
 Li, X., Wang, D., Li, J., & Zhang, B. (2007). Video search in concept subspace: A textlike paradigm. In CIVR ’07: Proceedings of the 6th ACM international conference on image and video retrieval (pp. 603–610). New York, NY, USA: ACM. doi: 10.1145/1282280.1282366.
 Liu, J. S. (2002). Monte Carlo strategies in scientific computing. New York: Springer.Google Scholar
 Markowitz, H. (1952). Portfolio selection. The Journal of Finance, 7(1), 77–91. http://www.jstor.org/stable/2975974.Google Scholar
 McDonald, K., & Smeaton, A. F. (2005). A comparison of score, rank and probabilitybased fusion methods for video shot retrieval. In Image and video retrieval. (Vol. 3568/2005, pp. 61–70). Berlin/Heidelberg: Springer. doi: 10.1007/1152634610.
 Naphade, M., Smith, J., Tesic, J., Chang, S.F., Hsu, W., Kennedy, L., Hauptmann, A. G., & Curtis, J. (2006). Largescale concept ontology for multimedia. IEEE MultiMedia, 13(3), 86–91. doi: 10.1109/MMUL.2006.63.CrossRefGoogle Scholar
 Papoulis, A. (1984). Probability, random variables, and stochastic processes. Singapore: McGraw Hill.zbMATHGoogle Scholar
 Platt, J. (2000). Advances in large margin classifiers. Cambridge, MA: MIT Press, chap probabilistic outputs for support vector machines and comparison to regularized likelihood methods (pp. 61–74).Google Scholar
 Robertson, S. E., van Rijsbergen, C. J., & Porter, M. F. (1981). Probabilistic models of indexing and searching. In SIGIR ’80: Proceedings of the 3rd annual ACM conference on research and development in information retrieval (pp. 35–56). Kent, UK: Butterworth & Co.Google Scholar
 Smeaton, A. F., Over, P., & Kraaij, W. (2006). Evaluation campaigns and trecvid. In MIR ’06: Proceedings of the 8th ACM international workshop on multimedia information retrieval (pp. 321–330). New York, NY, USA: ACM Press. doi: 10.1145/1178677.1178722.
 Snoek, C. G. M., & Worring, M. (2009). Conceptbased video retrieval. Foundations and Trends in Information Retrieval, 4(2), 215–322.Google Scholar
 Snoek, C. G. M., Worring, M., van Gemert, J. C., Geusebroek, J. M., & Smeulders, A. W. M. (2006). The challenge problem for automated detection of 101 semantic concepts in multimedia. In MULTIMEDIA ’06: Proceedings of the 14th annual ACM international conference on multimedia (pp. 421–430). New York, NY, USA: ACM Press. doi: 10.1145/1180639.1180727.
 Snoek, C. G. M., Huurnink, B., Hollink, L., de Rijke, M., Schreiber, G., & Worring, M. (2007). Adding semantics to detectors for video retrieval. IEEE Transactions on Multimedia, 9(5), 975–986.CrossRefGoogle Scholar
 Snoek, C. G. M., van de Sande, K., de Rooij, O., Huurnink, B., van Gemert, J., Uijlings, J., He, J., Li, X., Everts, I., Nedovic, V., van Liempt, M., van Balen, R., de Rijke, M., Geusebroek, J., Gevers, T., Worring, M., Smeulders, A., Koelma, D., Yan, F., Tahir, M., Mikolajczyk, K., & Kittler, J. (2008). The MediaMill TRECVid 2008 semantic video search engine. In Proceedings of the 8th TRECVid workshop, Gaithersburg, USA.Google Scholar
 Toharia, P., Robles, O. D., Smeaton, A. F., & Rodríguez, A. (2009). Measuring the influence of concept detection on video retrieval. In CAIP 2009—13th international conference on computer analysis of images and patterns. Berlin: Springer.Google Scholar
 Voorhees, E. M., & Harman, D. (2000). Overview of the ninth text retrieval conference (trec9). In Proceedings of the ninth text REtrieval conference (TREC9), pp. 1–14.Google Scholar
 de Vries, A. P., Kazai, G., & Lalmas, M. (2004). Tolerance to irrelevance: A usereffort oriented evaluation of retrieval systems without predefined retrieval unit. In RIAO 2004 conference proceedings (pp. 463–473). France: Avignon.Google Scholar
 Wang, J. (2009). Meanvariance analysis: A new document ranking theory in information retrieval. In ECIR ’09: Proceedings of the 31th European conference on IR research on advances in information retrieval (pp. 4–16). Berlin, Heidelberg: Springer. doi: 10.1007/97836420095874.
 Wang, J., & Zhu, J. (2009). Portfolio theory of information retrieval. In SIGIR ’09: Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval (pp. 115–122). New York, NY, USA: ACM. doi: 10.1145/1571941.1571963.
 Yan, R. (2006). Probabilistic models for combining diverse knowledge sources in multimedia retrieval, PhD thesis. Carnegie Mellon University. http://yanrong.info/publications.htm.
 Yang, J., & Hauptmann, A. G. (2006). Exploring temporal consistency for video analysis and retrieval. In MIR ’06: Proceedings of the 8th ACM international workshop on multimedia information retrieval (pp. 33–42). New York, NY, USA: ACM. doi: 10.1145/1178677.1178685.
 Yang, J., & Hauptmann, A. G. (2008). (un)reliability of video concept detection. In CIVR ’08: Proceedings of the 2008 international conference on contentbased image and video retrieval (pp. 85–94). New York, NY, USA: ACM. doi: 10.1145/1386352.1386367.
 Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2), 179–214. doi: 10.1145/984321.984322.CrossRefGoogle Scholar
 Zheng, W., Li, J., Si, Z., Lin, F., & Zhang, B. (2006). Using highlevel semantic features in video retrieval. In Image and video retrieval (Vol. 4071/2006, pp. 370–379). Berlin/Heidelberg: Springer. doi: 10.1007/11788034_38.