Abstract
We present some theoretical results related to the problem of actively searching a 3D scene to determine the positions of one or more prespecified objects. We investigate the effects that input noise, occlusion, and the VCdimensions of the related representation classes have in terms of localizing all objects present in the search region, under finite computational resources and a search cost constraint. We present a number of bounds relating the noiserate of low level feature detection to the VCdimension of an object representable by an architecture satisfying the given computational constraints. We prove that under certain conditions, the corresponding classes of object localization and recognition problems are efficiently learnable in the presence of noise and under a purposive learning strategy, as there exists a polynomial upper bound on the minimum number of examples necessary to correctly localize the targets under the given models of uncertainty. We also use these arguments to show that passive approaches to the same problem do not necessarily guarantee that the problem is efficiently learnable. Under this formulation, we prove the existence of a number of emergent relations between the object detection noiserate, the scene representation length, the object class complexity, and the representation class complexity, which demonstrate that selective attention is not only necessary due to computational complexity constraints, but it is also necessary as a noisesuppression mechanism and as a mechanism for efficient object class learning. These results concretely demonstrate the advantages of active, purposive and attentive approaches for solving complex vision problems.
This is a preview of subscription content, log in to check access.
References
Aloimonos, J., Bandopadhay, A., & Weiss, I. (1988). Active vision. International Journal of Computer Vision, 1, 333–356.
Andreopoulos, A., & Tsotsos, J. K. (2008). Active vision for door localization and door opening using playbot: A computer controlled wheelchair for people with mobility impairments. In Proc. 5th Canadian conference on computer and robot vision.
Andreopoulos, A., & Tsotsos, J. K. (2009). A theory of active object localization. In Proc. int. conf. on computer vision.
Andreopoulos, A., & Tsotsos, J. K. (2012). On sensor bias in experimental methods for comparing interest point, saliency and recognition algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(1), 110–126.
Andreopoulos, A., Hasler, S., Wersing, H., Janssen, H., Tsotsos, J. K., & Körner, E. (2011). Active 3D object localization using a humanoid robot. IEEE Transactions on Robotics, 27(1), 47–64.
Angluin, D., & Laird, P. (1988). Learning from noisy examples. Machine Learning, 2(4), 343–370.
Aristotle (350 B.C.) \(\varPi\epsilon\rho\acute{\iota}\) \(\varPsi\upsilon\chi\acute{\eta}\varsigma\) (On the Soul).
Bajcsy, R. (1985). Active perception vs. passive perception. In IEEE workshop on computer vision representation and control, Bellaire, Michigan.
Ballard, D. (1991). Animate vision. Artificial Intelligence, 48, 57–86.
Barrow, H., & Popplestone, R. (1971). Relational descriptions in picture processing. Machine Intelligence, 6, 377–396.
Bartlett, P. L., & Mendelson, S. (2002). Rademacher and Gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research, 3, 463–482.
Bartlett, P. L., Long, P. M., & Williamson, R. C. (1996). Fatshattering and the learnability of realvalued functions. Journal of Computer and System Sciences, 52, 434–452.
Baum, E., & Haussler, D. (1989). What size net gives valid generalization? Neural Computation, 1(1), 151–160.
BenDavid, S., & Lindenbaum, M. (1998). Localization vs. identification of semialgebraic sets. Machine Learning, 32, 207–224.
Biederman, I. (1987). Recognitionbycomponents: a theory of human image understanding. Psychological Review, 94, 115–147.
Boshra, M., & Bhanu, B. (2000). Predicting performance of object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(9), 956–969.
Brentano, F. (1874). Psychologie vom empirischen Standpunkt. Leipzig: Meiner.
Broadbent, D. (1958). Perception and communication. Elmsford: Pergamon Press.
Brooks, R., Greiner, R., & Binford, T. (1979). The ACRONYM modelbased vision system. In Proc. of 6th int. joint conf. on artificial intelligence.
Bruce, N. D., & Tsotsos, J. K. (2009). Saliency, attention and visual search: an information theoretic approach. Journal of Vision, 9(3), 1–24.
Callari, F., & Ferrie, F. (2001). Active recognition: looking for differences. International Journal of Computer Vision, 43(3), 189–204.
de Berg, M., van Krefeld, M., Overmars, M., & Schwarzkopf, O. (2000). Computational geometry: algorithms and applications. Berlin: Springer.
Dickinson, S., Christensen, H., Tsotsos, J., & Olofsson, G. (1997). Active object recognition integrating attention and viewpoint control. Computer Vision and Image Understanding, 67(3), 239–260.
Dickinson, S., Wilkes, D., & Tsotsos, J. (1999). A computational model of view degeneracy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(8), 673–689.
Ekvall, S., Jensfelt, P., & Kragic, D. (2006). Integrating active mobile robot object recognition and SLAM in natural environments. In Proc. Intelligent robots and systems.
Findlay, J. M., & Gilchrist, I. D. (2003). Active vision: the psychology of looking and seeing. London: Oxford University Press.
Fukushima, K. (1980). Neocognitron: a selforganizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4), 193–202.
Garvey, T. (1976). Perceptual strategies for purposive vision (Tech. rep., Nr. 117). SRI Int’l.
Gerstner, W., & Kistler, W. (2002). Spiking neuron models: single neurons, populations, plasticity. Cambridge: Cambridge University Press.
Gibson, J. (1979). The ecological approach to visual perception. Boston: Houghton Mifflin.
Giefing, G., Janssen, H., & Mallot, H. (1992). Saccadic object recognition with an active vision system. In International conference on pattern recognition.
Grimson, W. E. L. (1991). The combinatorics of heuristic search termination for object recognition in cluttered environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13, 920–935.
Grossberg, S. (1973). Contour enhancement, shortterm memory, and constancies in reverberating neural networks. Studies in Applied Mathematics, 52, 213–257.
Hinton, G. (1978). Relaxation and its role in vision. PhD thesis, University of Edinburgh.
Ikeuchi, K., & Kanade, T. (1988). Automatic generation of object recognition programs. Proceedings of the IEEE, 76(8), 1016–1035.
Itti, L., Koch, C., & Niebur, E. (1998). A model of saliencybased visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11), 1254–1259.
Kearns, M. (1993). Efficient noisetolerant learning from statistical queries. In Proc. of the 25th ACM symposium on the theory of computing.
Kearns, M. J., & Vazirani, U. V. (1994). An introduction to computational learning theory. Cambridge: MIT Press.
Laporte, C., & Arbel, T. (2006). Efficient discriminant viewpoint selection for active Bayesian recognition. International Journal of Computer Vision, 68(3), 267–287.
Lindenbaum, M. (1997). An integrated model for evaluating the amount of data required for reliable recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(11), 1251–1264.
Marr, D. (1982). Vision: a computational investigation into the human representation and processing of visual information. New York: Freeman.
Maver, J., & Bajcsy, R. (1993). Occlusions as a guide for planning the next view. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(5), 417–433.
McAllester, D. A. (2003). PacBayesian stochastic model selection. Machine Learning, 51, 5–21.
Meger, D., Forssen, P., Lai, K., Helmer, S., McCann, S., Southey, T., Baumann, M., Little, J., & Lowe, D. (2008). Curious George: an attentive semantic robot. Robotics and Autonomous Systems, 56(6), 503–511.
Minsky, M., & Papert, S. (1969) Perceptrons. Cambridge, MIT Press.
Najemnik, J., & Geisler, W. S. (2005). Optimal eye movement strategies in visual search. Nature, 434, 387–391.
Navalpakkam, V., & Itti, L. (2005). Modeling the influence of task on attention. Vision Research, 45(2), 205–231.
Nevatia, R., & Binford, T. (1977). Description and recognition of curved objects. Artificial Intelligence, 8, 77–98.
Rimey, R. D., & Brown, C. M. (1994). Control of selective perception using Bayes nets and decision theory. International Journal of Computer Vision, 12(2/3), 173–207.
Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408.
Roy, S. D., Chaudhury, S., & Banerjee, S. (2000). Isolated 3D object recognition through next view planning. IEEE Transactions on Systems, Man and Cybernetics. Part A. Systems and Humans, 30(1), 67–76.
Saidi, F., Stasse, O., Yokoi, K., & Kanehiro, F. (2007). Online object search with a humanoid robot. In Proc. Intelligent robots and systems.
Schiele, B., & Crowley, J. (1998). Transinformation for active object recognition. In Proc. int. conf. on computer vision.
Seeger, M. (2002). The proof of McAllester’s PacBayesian theorem. In: Advances in neural information processing systems.
Thorpe, S., Fize, D., & Marlot, C. (1996). Speed of processing in the human visual system. Nature, 381(6582), 520–522.
Tsotsos, J. K. (1990). Analyzing vision at the complexity level. Behavioral and Brain Sciences, 13(3), 423–445.
Tsotsos, J. K. (1992). On the relative complexity of active vs. passive visual search. International Journal of Computer Vision, 7(2), 127–141.
Tsotsos, J. K. (2011). A computational perspective on visual attention. Cambridge: MIT Press.
Tsotsos, J. K., Culhane, S. M., Wai, W. Y. K., Lai, Y., Davis, N., & Nuflo, F. (1995). Modeling visual attention via selective tuning. Artificial Intelligence, 78, 507–545.
Tsotsos, J., Liu, Y., MartinezTrujillo, J., Pomplun, M., Simine, E., & Zhou, K. (2005). Attending to visual motion. Computer Vision and Image Understanding, 100(1–2), 3–40.
Valiant, L. (1984a). Deductive learning. Philosophical Transactions of the Royal Society of London, 312, 441–446.
Valiant, L. (1984b). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142.
Valiant, L. (1985). Learning disjunctions of conjunctions. In Proc. 9th international joint conference on artificial intelligence.
Verghese, P., & Pelli, D. (1992). The information capacity of visual attention. Vision Research, 32(5), 983–995.
Wixson, L. E., & Ballard, D. H. (1994). Using intermediate objects to improve the efficiency of visual search. International Journal of Computer Vision, 12(2/3), 209–230.
Ye, Y., & Tsotsos, J. (1999). Sensor planning for 3D object search. Computer Vision and Image Understanding, 73(2), 145–168.
Ye, Y., & Tsotsos, J. (2001). A complexity level analysis of the sensor planning task for object search. Computational Intelligence, 17(4), 605–620.
Author information
Additional information
Alexander Andreopoulos produced this work while at York University.
Appendices
Appendix A: Proof of Theorem 3 Using a Purposive Sampling Strategy
We provide a proof Theorem 3 that uses Angluin and Laird’s model for learning conjunctions from noisy examples (Angluin and Laird 1988; Kearns and Vazirani 1994), with some notable differences and modifications, due to the constraints set forth by the use of partial oracles for the object localization problem, under a purposive sampling strategy. Please note that the reader may skip the Appendices, without a significant loss of continuity in his understanding of the paper’s main points.
A.1 Overview
In order to be consistent with our previously introduced notation, we assume a literal l _{ i } can represent either ¬b _{ i } or b _{ i } (l _{ i }∈{b _{ i },¬b _{ i }}), for some boolean variable b _{ i }.
Definition 40
(Significant Literal)
We say that a literal l _{ i } is significant with respect to target map sampling distribution \(\mathcal{\overline{D}}(T)\) of the elements in \(\mathbf{X}^{\{0,1,\alpha\}}_{T}\) and an \(0<\epsilon<\frac{1}{2}\), if \(p_{0}(l_{i})>\frac{\epsilon}{8T}\), where for a random sample x=(x _{1},…,x _{T}) sampled from \(\mathcal{\overline{D}}(T)\), p _{0}(b _{ i }) denotes the probability that x _{ i }=0 and p _{0}(¬b _{ i }) denotes the probability that x _{ i }=1.
Notice in the above definition, that for any target map cell i, the higher the probability that \(\mathcal{\overline{D}}(T)\) assigns a label of α to cell i, the less likely it is that a literal l _{ i } is significant.
Definition 41
(Harmful Literal)
We say that a literal l _{ i } is harmful with respect to target map sampling distribution \(\mathcal{\overline{D}}(T)\) of the elements in \(\mathbf{X}^{\{0,1,\alpha\}}_{T}\), an \(0<\epsilon<\frac{1}{2}\) and a partial concept \(\overline{c}:\mathbf{X}^{\{0,1,\alpha\}}_{T}\rightarrow \{0,1\}\), if \(p_{\bar{c}}(l_{i})>\frac{\epsilon}{8T}\), where for a random sample x=(x _{1},…,x _{T}) sampled from \(\mathcal{\overline{D}}(T)\), \(p_{\bar{c}}(b_{i})\) denotes the probability that x _{ i }=0 and \(\overline{c}(\mathbf{x})=1\), while \(p_{\bar{c}}(\neg b_{i})\) denotes the probability that x _{ i }=1 and \(\overline{c}(\mathbf{x})=1\).
We assume that every hypothesis h(⋅) belongs to the concept class \(\bigcup^{T}_{i=0}\overline{\mathcal{C}}_{i}\), where \(\overline{\mathcal{C}}_{i}\) denotes the concept class of all i−CNF formulae, as per Definition 3. We will show that for all \(0<\epsilon<\frac{1}{2}\), if we represent h(⋅) by the conjunction of all significant literals that are not harmful with respect to \(\mathcal{\overline{D}}(T)\), ϵ and some \(c(\cdot)\in \mathcal{C}_{T}\), then we have \(\mathit{error}(\overline{\mathcal{D}}(T),\overline{h},\overline{c})\leq \epsilon\), where \(\overline{h}\), \(\overline{c}\) are the partial concepts of h, c respectively. Moreover, for any \(0\leq \eta<\frac{1}{2}\) and for any \(0<\delta<\frac{1}{2}\), we can use a corrupt target map oracle of the form \(\mathit{EX}(\mathcal{\overline{D}}(T),c,\eta)\) to approximate the significant and nonharmful literals, so that with confidence at least 1−δ, we have \(\mathit{error}(\overline{\mathcal{D}}(T),\overline{h},\overline{c})\leq \epsilon\). Notice that by defining h(⋅), we have also implicitly defined the corresponding partial concept \(\overline{h}(\cdot)\). Once this is demonstrated, we discuss the problem of simulating the corrupt target map oracle \(\mathit{EX}(\mathcal{\overline{D}}(T),c,\eta)\).
As one would expect, the overall noise level η of the modules used to simulate the corrupt target map oracle cannot be too high and this will be rigorously proven later in this section. However, a surprising implication is that if we require an efficient simulation of the target map oracle, η cannot be arbitrarily close to zero either. This is surprising since in typical computational learning problems, the lower the examples’ noise rate, the faster the learning can take place. As we will see, there exists a pushpull relationship in that a lower noise rate η requires fewer samples by the corrupt target map oracle, but makes each invocation of the oracle’s simulation more inefficient. Conversely, a higher noise rate η makes the oracle’s simulation more efficient, but also increases the number of samples that the corrupt target map oracle needs to return. In practice we can circumvent this cost, because the trainingrelated processing can be done once and offline before the search starts. In many computational learning problems, this issue is not apparent as it is assumed that each call to the oracle is completed within unit time. This implies that there exists an inherent upper and lower bound on the permissible sources of errors—such as the error rates of lowlevel feature detectors—, for efficient localization to be capable of taking place. While the inability to have an arbitrarily low error bound might not be so important in our problem (since we can assume that the object detector’s training is done offline) this can have important implications in the Simultaneous Localization and Mapping (SLAM) problem. In the SLAM problem for example, we might wish to obtain high level representations of scene objects online (i.e., efficiently) as a means of mapping the environment and reducing the errors in the agent’s position/localization inside this map. In such a case, the better the object representations that we wish to learn, the more computationally demanding the problem will become, due to the increased computational demands on the object detector’s training, contradicting the intuitive belief that a lower error rate η leads to a more efficient algorithm. We will see that these results lead to a proof of Theorem 3.
Given an example \(\mathbf{x}=(x_{1},\ldots,x_{T})\in \mathbf{X}^{\{0,1,\alpha\}}_{T}\) sampled from \(\mathcal{\overline{D}}(T)\), an error occurs with respect to the error function of Theorem 3 if \(\overline{h}(\mathbf{x})=1\wedge \overline{c}(\mathbf{x})=0\) or \(\overline{h}(\mathbf{x})=0\wedge \overline{c}(\mathbf{x})=1\) holds. For \(\overline{h}(\mathbf{x})=1\wedge \overline{c}(\mathbf{x})=0\) to hold, two events A(l _{ i }), B(l _{ i },x) must occur for at least one of the 2T possible literals l _{ i }:

A(l _{ i }): l _{ i }∈c(⋅) and \(l_{i}\not\in h(\cdot)\).

B(l _{ i },x): If l _{ i }=b _{ i } then x _{ i }=0 so that the assignment b _{ i }←x _{ i } sets l _{ i } to false, and, if l _{ i }=¬b _{ i } then x _{ i }=1 so that the assignment b _{ i }←x _{ i } sets l _{ i } to false.
By the construction of h(⋅) and since c(⋅) contains no harmful literals, if \(l_{i}\not\in h(\cdot)\), then l _{ i } must not be significant. Thus, by the union bound, the probability of \(\overline{h}(\mathbf{x})=1\wedge \overline{c}(\mathbf{x})=0\) occurring for a sample \(\mathbf{x}\in \mathcal{\overline{D}}(T)\), is bounded by
where \(\overline{\mathcal{D}}\) is shorthand for \(\overline{\mathcal{D}}(T)\) and b≜{b _{ i },¬b _{ i }}. Similarly, for \(\overline{h}(\mathbf{x})=0\wedge \overline{c}(\mathbf{x})=1\) to hold for a sample \(\mathbf{x}\in \mathcal{\overline{D}}(T)\), two events A′(l _{ i }), B′(l _{ i },x) must occur for at least one of the 2T possible literals l _{ i }:

A′(l _{ i }): \(l_{i}\not\in c(\cdot)\) and l _{ i }∈h(⋅).

B′(l _{ i },x): If l _{ i }=b _{ i } then x _{ i }=0 and \(\overline{c}(\mathbf{x})=1\). If l _{ i }=¬b _{ i } then x _{ i }=1 and \(\overline{c}(\mathbf{x})=1\).
By the construction of h(⋅), if l _{ i }∈h(⋅), then l _{ i } must not be harmful. Thus, by the union bound, the probability of \(\overline{h}(\mathbf{x})=0\wedge \overline{c}(\mathbf{x})=1\) occurring for a sample \(\mathbf{x}\in \mathcal{\overline{D}}(T)\), is bounded by
where as before b≜{b _{ i },¬b _{ i }}.
Thus, the probability of error is upper bounded by \(\frac{\epsilon}{4}+\frac{\epsilon}{4}=\frac{\epsilon}{2}\leq \epsilon\) as wanted. Notice, however, that p _{0}(l _{ i }), \(p_{\bar{c}}(l_{i})\) are the true probabilities, which we rarely have access to. Thus, assuming we have at our disposal a corrupt target map oracle \(\mathit{EX}(\mathcal{\overline{D}}(T),c,\eta)\), we are confronted by the question of whether there exists an upper bound on the cardinality of an example set that is acquired with such an oracle and is used to construct h(⋅), so that for all \(0<\epsilon,\delta<\frac{1}{2}\) and for all \(0\leq \eta<\frac{1}{2}\), with probability at least 1−δ, \(\mathit{error}(\mathcal{\overline{D}},\overline{h},\overline{c})\leq \epsilon\). Let \(\hat{p}_{0}(l_{i})\), \(\hat{p}_{\bar{c}}(l_{i})\) denote the estimated probabilities of p _{0}(l _{ i }) and \(p_{\bar{c}}(l_{i})\) respectively, assuming these estimates are based on the proportion of m examples acquired using the corrupt target map oracle \(\mathit{EX}(\mathcal{\overline{D}}(T),c,\eta)\) that satisfy the respective events. Notice that the estimate \(\hat{p}_{0}(l_{i})\) is independent of the noise rate η in the oracle and of the binary label of each example. Lemma 2 below, gives strong hints that if we sample the oracle a sufficient number of times, we can say with a minimum degree of confidence that there will be a bound on the errors of the estimates \(\hat{p}_{0}(l_{i})\) for any literal l _{ i }. Notice, however, that we have not specified bounds on the error and confidence levels of \(\hat{p}_{0}(l_{i})\) that would make Theorem 3 hold. We postpone this discussion for Sect. A.2, since we first need to discuss how we estimate \(\hat{p}_{\bar{c}}(l_{i})\) with sufficient error and confidence. We now present a well known result from the literature. We use this result extensively, as a means of obtaining a sufficient number of feature samples to reliably localize the desired object instances that are present in the search space.
Lemma 2
Assume X _{1},…,X _{ m } is a sample of m independent Bernoulli random variables, where for all 1≤i≤m, we have E(X _{ i })=p. If \(\hat{p}=\frac{X_{1}+\cdots+X_{m}}{m}\), \(0<\delta,\epsilon<\frac{1}{2}\) and \(m\geq \frac{1}{\epsilon^{2}}\log(\frac{2}{\delta})\), then with confidence at least 1−δ, event \(p\epsilon\leq \hat{p}\leq p+\epsilon\) occurs.
Proof
This is readily derived from Chernoff’s bounds which guarantee that for 0≤ϵ≤1, \(P[p\hat{p}>\epsilon]\leq 2\exp(2m\epsilon^{2})\). If δ denotes an upper bound on \(P[p\hat{p}>\epsilon]\), then, 2exp(−2mϵ ^{2})<δ is a sufficient condition on a range of valid values for δ. But this inequality holds iff \(m>\frac{1}{2\epsilon^{2}}\log(\frac{2}{\delta})\). This implies that, if \(m\geq \frac{1}{\epsilon^{2}}\log(\frac{2}{\delta})\), with confidence at least 1−δ, \(p\hat{p}\leq \epsilon\) will hold. □
We now discuss how we could estimate \(\hat{p}_{\bar{c}}(l_{i})\) with arbitrarily good error and confidence bounds, assuming we have at our disposal a corrupt target map oracle \(\mathit{EX}(\mathcal{\overline{D}}(T),c,\eta)\). As previously indicated, the scene complexity is parameterized by the length \(\bar{l}\) defining the set of scene representations \(\mathcal{M}(\bar{l},\overline{\varGamma},\mathcal{R}_{1},\mathcal{R}_{2})\). We first state in Lemma 3 below some popular results from the literature, that overview how the problem could be dealt with if we had at our disposal an oracle with a nonzero noise rate. These results will be used in Sect. A.3 when proving Theorem 3 with a simulated corrupt target map oracle \(\mathit{EX}(\mathcal{\overline{D}}(T),c,\eta)\). Again, we have not discussed the selection of the proper error and confidence bounds for \(\hat{p}_{\bar{c}}(l_{i})\) that would guarantee that Theorem 3 holds, as we postpone the relevant discussion for Sects. A.2 and A.3.
Lemma 3
Consider some corrupt target map oracle \(\mathit{EX}(\mathcal{\overline{D}}(T),c,\eta)\) defined for a \(0\leq \eta<\frac{1}{2}\), a concept \(c:\mathbf{X}^{\{0,1\}}_{T}\rightarrow\{0,1\}\) and a target map distribution \(\mathcal{\overline{D}}(T)\). Let \(\mathit{EX}(\mathcal{\overline{D}}(T),c)\) denote the corresponding noise free oracle (η=0). For \(\mathbf{x}\in \mathbf{X}^{\{0,1,\alpha\}}_{T}\), let \(\chi_{z}(\mathbf{x},\overline{c}(\mathbf{x}))\) be a deterministic function depending on \(\mathbf{x},\overline{c}(\mathbf{x})\) and whose range is {0,1}. Let X _{1} consist of all \(\mathbf{x}\in \mathbf{X}^{\{0,1,\alpha\}}_{T}\) for which χ _{ z }(x,0)≠χ _{ z }(x,1). Let X _{2} consist of all \(\mathbf{x}\in \mathbf{X}^{\{0,1,\alpha\}}_{T}\) for which χ _{ z }(x,0)=χ _{ z }(x,1). Also let \(p_{1}=P_{\mathbf{x}\in \mathcal{\overline{D}}}[\mathbf{x}\in \mathbf{X}_{1}]\), where \(\mathcal{\overline{D}}\) is shorthand notation for \(\mathcal{\overline{D}}(T)\). Finally, let \(\mathcal{D}_{1}\) correspond to the conditional distribution of \(\mathcal{\overline{D}}(T)\) restricted to samples from X _{1} (in other words, \(P_{\mathbf{x}\in \mathcal{D}_{1}}[\mathbf{x}\in\mathbf{S}]=\frac{P_{\mathbf{x}\in \mathcal{\overline{D}}}[\mathbf{x}\in\mathbf{S}]}{p_{1}}\) for any S⊆X _{1}). If \(P_{\chi_{z}}\triangleq P_{\langle\mathbf{x},a\rangle\in \mathit{EX}(\mathcal{\overline{D}},c)}[\chi_{z}(\mathbf{x},a)=1]\), then
Proof
A proof can be found in Kearns and Vazirani (1994) and is derived using elementary algebra and the basic Kolmogorov axioms of probability. □
Assume z=l _{ i } for some literal l _{ i }∈{b _{ i },¬b _{ i }} of boolean variable b _{ i }. Also, assume \(\mathbf{x}=(x_{1},\ldots,x_{T})\in \mathbf{X}^{\{0,1,\alpha\}}_{T}\) and a∈{0,1}. For l _{ i }=b _{ i }, we define χ _{ z }(x,a) as assuming a value of one iff x _{ i }=0 and a=1. For l _{ i }=¬b _{ i }, we define χ _{ z }(x,a) as assuming a value of one iff x _{ i }=1 and a=1. By Lemmas 2 and 3 we see that in order to obtain a sufficiently accurate estimate \(\hat{P}_{\chi{}_{z}}\) for \(P_{\chi{}_{z}}\) we will need to obtain sufficiently good estimates \(\hat{p}_{1}\), \(\hat{P}_{\langle\mathbf{x},a\rangle\in \mathit{EX}(\mathcal{\overline{D}},c,\eta)}[(\chi_{z}(\mathbf{x},a)=1)\wedge (\mathbf{x}\in \mathbf{X}_{1})]\), \(\hat{P}_{\langle\mathbf{x},a\rangle\in \mathit{EX}(\mathcal{\overline{D}},c,\eta)}[(\chi_{z}(\mathbf{x},a)=1)\wedge (\mathbf{x}\in \mathbf{X}_{2})]\), \(\hat{\eta}\) for the probabilities p _{1}, \(P_{\langle\mathbf{x},a\rangle\in \mathit{EX}(\mathcal{\overline{D}},c,\eta)}[(\chi_{z}(\mathbf{x},a)=1)\wedge (\mathbf{x}\in \mathbf{X}_{1})]\), \(P_{\langle\mathbf{x},a\rangle\in \mathit{EX}(\mathcal{\overline{D}},c,\eta)}[(\chi_{z}(\mathbf{x},a)=1)\wedge (\mathbf{x}\in \mathbf{X}_{2})]\) and η respectively. While we see that by Lemma 2 we could obtain arbitrarily good estimates for the probabilities p _{1}, \(P_{\langle\mathbf{x},a\rangle\in \mathit{EX}(\mathcal{\overline{D}},c,\eta)}[(\chi_{z}(\mathbf{x},a)=1)\wedge (\mathbf{x}\in \mathbf{X}_{1})]\) and \(P_{\langle\mathbf{x},a\rangle\in \mathit{EX}(\mathcal{\overline{D}},c,\eta)}[(\chi_{z}(\mathbf{x},a)=1)\wedge (\mathbf{x}\in \mathbf{X}_{2})]\), it is not clear how to obtain sufficiently good estimates for η so that we can obtain arbitrarily good estimates for \(P_{\chi_{z}}\) (recall that the exact value of η is unknown). The lemma below provides a solution.
Lemma 4
Assume that we know an upper bound \(0\leq \eta_{0}<\frac{1}{2}\) on the otherwise unknown noise rate η of the corrupt target map oracle \(\mathit{EX}(\mathcal{\overline{D}}(T),c,\eta)\). Assume that we have access to the probabilities p _{0}(z) for all 2T literals z. Also, assume that for any \(0<\epsilon', \delta'<\frac{1}{2}\), and any of the 2T literals z, we can efficiently find estimates \(\hat{p}_{1}\), \(\hat{P}_{\langle\mathbf{x},a\rangle\in \mathit{EX}(\mathcal{\overline{D}},c,\eta)}[(\chi_{z}(\mathbf{x},a)=1)\wedge (\mathbf{x}\in \mathbf{X}_{1})]\), \(\hat{P}_{\langle\mathbf{x},a\rangle\in \mathit{EX}(\mathcal{\overline{D}},c,\eta)}[(\chi_{z}(\mathbf{x},a)=1)\wedge (\mathbf{x}\in \mathbf{X}_{2})]\) so that with confidence at least 1−δ′, \(p_{1}\hat{p}_{1}\leq \epsilon'\), with confidence at least 1−δ′, \(P_{\langle\mathbf{x},a\rangle\in \mathit{EX}(\mathcal{\overline{D}},c,\eta)}[(\chi_{z}(\mathbf{x},a)=1)\wedge (\mathbf{x}\in \mathbf{X}_{1})] \hat{P}_{\langle\mathbf{x},a\rangle\in \mathit{EX}(\mathcal{\overline{D}},c,\eta)}[(\chi_{z}(\mathbf{x},a)=1)\wedge (\mathbf{x}\in \mathbf{X}_{1})]\leq \epsilon'\) and with confidence at least 1−δ′, we have \(P_{\langle\mathbf{x},a\rangle\in \mathit{EX}(\mathcal{\overline{D}},c,\eta)}[(\chi_{z}(\mathbf{x},a)\allowbreak=1)\wedge (\mathbf{x}\in \mathbf{X}_{2})]\hat{P}_{\langle\mathbf{x},a\rangle\in \mathit{EX}(\mathcal{\overline{D}},c,\eta)}[(\chi_{z}(\mathbf{x},a)=1)\wedge (\mathbf{x}\in \mathbf{X}_{2})]\leq \epsilon'\). Then, the exists an algorithm Λ, such that for any \(0<\epsilon\leq \frac{1}{8T}\), and any \(0<\delta<\frac{1}{2}\), Λ outputs for each of the 2T literals z, an estimate \(\hat{P}_{\chi_{z}}\) for \(P_{\chi_{z}}\), which satisfies \(P_{\chi_{z}}\hat{P}_{\chi_{z}}\leq \epsilon\) with confidence at least 1−δ. If each invocation of the oracle takes unit time to complete, the algorithm has a running time that lies in Θ(ζ _{1}(ϵ,T,η _{0})(6Tζ _{2}(ϵ′,δ′)+ζ _{2}(ϵ″,δ″))), where \(\epsilon'=\frac{\epsilon}{27}\), \(\epsilon''=\frac{\epsilon (12\eta_{0})}{2}\), \(\delta'=\delta''=\frac{\delta}{(6T+1)\zeta_{1}(\epsilon,T,\eta_{0})}\), and ζ _{1},ζ _{2} are defined so that \(\zeta_{1}(\hat{\epsilon},T,\eta_{0})=\frac{c_{0}T}{\hat{\epsilon}(12\eta_{0})^{2}}\), and \(\zeta_{2}(\hat{\epsilon},\hat{\delta})=\frac{1}{\hat{\epsilon}^{2}}\log(\frac{2}{\hat{\delta}})\) for some constant c _{0}>0 and any \(0<\hat{\epsilon},\hat{\delta}<\frac{1}{2}\). Also the algorithm calls the oracle ζ _{1}(ϵ,T,η _{0})(6Tζ _{2}(ϵ′,δ′)+ζ _{2}(ϵ″,δ″)) times, which is a polynomial with respect to \(T,\frac{1}{\epsilon}\), \(\frac{1}{\delta}\), and \(\frac{1}{12\eta_{0}}\).
Proof
This is proven by Kearns (1993), where an iterative and efficient hypothesizeandtest algorithm for learning in the presence of noisy statistical queries is described. The only notable difference is our use of partial concepts, but it is straightforward to see that the results still hold in our case. We briefly overview the algorithm’s behaviour. The algorithm iterates ζ _{1}(ϵ,T,η _{0}) times. For each of the ζ _{1}(ϵ,T,η _{0}) iterations of the loop, the algorithm hypothesizes an estimated value \(\hat{\eta}\) for η, and based on this hypothesis, it estimates the probability values described next. For each iteration it uses the target map oracle to acquire ζ _{2}(ϵ′,δ′) examples for each of the probabilities that we are approximating, as indicated in the lemma (for a total of 6Tζ _{2}(ϵ′,δ′) examples per iteration), making it possible to recalculate for each iteration the estimates \(\hat{p}_{1}\), \(\hat{P}_{\langle\mathbf{x},a\rangle\in \mathit{EX}(\mathcal{\overline{D}},c,\eta)}[(\chi_{z}(\mathbf{x},a)=1)\wedge (\mathbf{x}\in \mathbf{X}_{1})]\), \(\hat{P}_{\langle\mathbf{x},a\rangle\in \mathit{EX}(\mathcal{\overline{D}},c,\eta)}[(\chi_{z}(\mathbf{x},a)=1)\wedge (\mathbf{x}\in \mathbf{X}_{2})]\), with error at most \(\frac{\epsilon}{27}\) and confidence at least \(1\frac{\delta}{(6T+1)\zeta_{1}(\epsilon,T,\eta_{0})}\) for each probability. Each iteration’s resulting probability estimates are stored in an array. The estimated probabilities that were just acquired in the current iteration are used to define a CNF formula h for the current iteration, as the conjunction of all literals z that satisfy p _{0}(z)>ϵ and \(\hat{P}_{\chi_{z}}\leq \epsilon\). Given the h defined in the current iteration, by calling the oracle another ζ _{2}(ϵ″,δ″) times we obtain an estimate for another probability \(P_{\langle\mathbf{x},a\rangle\in \mathit{EX}(\mathcal{\overline{D}},c,\eta)}[\overline{h}(\mathbf{x})=a]\), so that the estimate has an error of at most ϵ″ and confidence at least 1−δ″. This last result is also stored in an array. A procedure that has polynomial running time with respect to T, \(\frac{1}{\epsilon}\), \(\frac{1}{\delta}\), \(\frac{1}{12\eta_{0}}\) is applied to the stored results of these ζ _{1}(ϵ,T,η _{0}) iterations to obtain the desired approximation to \(P_{\mathcal{\chi}_{z}}\) for all 2T literals z. This last procedure outputs the desired approximation by choosing the stored result that corresponds to one of the ζ _{1}(ϵ,T,η _{0}) hypothesized values for \(\hat{\eta}\), which minimizes a certain error metric, thus allowing us to also output the optimal \(\hat{\eta}\) that is sufficiently close to η. □
The above lemma allows us to circumvent the problem of not knowing the noise rate η in Eq. (7), as long as we estimate the above described probabilities with sufficient error and confidence, by using a sufficient number of examples (Lemma 2). An advantage of the above algorithm is that there is no need to know apriori the upper bound η _{0}, as it can be efficiently discovered, (through a binary search for example), since by assumption \(\frac{1}{12\eta}\) is bounded by a polynomial function of T, \(\frac{1}{\epsilon}\), \(\frac{1}{\delta}\), n _{4} and the hypothesized \(\hat{\eta}\) minimizes a certain error metric. For simplicity, and without any loss of generality, we assume to a priori know such an upper bound η _{0}.
A.2 Estimating the Significant and Harmful Literals’ Probabilities with a Corrupt Target Map Oracle
As discussed in Sect. A.1, for each literal l _{ i } we determine with a minimum degree of confidence, whether the literal is significant and harmful. We need to show that for all \(0<\epsilon,\delta<\frac{1}{2}\) and for all \(0\leq \eta<\frac{1}{2}\), if the above minimum confidence levels are chosen to be sufficiently high, a candidate hypothesis h for the object localization problem (Theorem 3) that is formed as a conjunction of all significant literals that are not harmful, can satisfy with confidence at least 1−δ, \(\mathit{error}(\mathcal{\overline{D}}(T),\overline{h},\overline{c})\leq \epsilon\), where \(\overline{c}\) and \(\overline{h}\) are the partial oracles of the target concept \(c\in \mathcal{C}_{T}\) and a concept \(h\in\bigcup^{T}_{i=0}\overline{\mathcal{C}}_{i}\) respectively.
In Sect. A.1 we defined \(\hat{p}_{0}(\cdot)\) and \(\hat{p}_{\bar{c}}(\cdot)\) and outlined some of the approaches for calculating them, assuming we had at out disposal a corrupt target map oracle \(\mathit{EX}(\mathcal{\overline{D}}(T),\allowbreak c,\eta)\). Notice, however, that Theorem 3 does not assume that such an oracle is at our disposal. So we also need to discuss a methodology for simulating a corrupt target map oracle.
As indicated in Sect. A.1, the confidence in using \(\hat{p}_{0}(l_{i})\) to approximate p _{0}(l _{ i }) (for any i∈{1,…,T} and any l _{ i }∈{b _{ i },¬b _{ i }}) is independent of the noise rate η. Thus, by Lemma 2 we know that for all \(0<\delta_{z},\epsilon_{z}<\frac{1}{2}\), (where for notational convenience z=l _{ i }) if we use the corrupt target map oracle \(\mathit{EX}(\mathcal{\overline{D}}(T),c,\eta)\) to acquire \(\frac{1}{\epsilon^{2}_{z}}\log(\frac{2}{\delta_{z}})\) examples, and use these examples to estimate p _{0}(l _{ i }), then with confidence at least 1−δ _{ z } the error in using \(\hat{p}_{0}(l_{i})\) to estimate p _{0}(l _{ i }) will be less than ϵ _{ z }. By Lemma 2, if for a literal l _{ i } and some \(0<\hat{\epsilon}_{z}<\frac{1}{2}\) our estimates \(\hat{p}_{0}(\cdot)\) indicate that \(\hat{p}_{0}(l_{i})\leq \hat{\epsilon}_{z}\), then \(p_{0}(l_{i})\leq \hat{\epsilon}_{z}+\epsilon_{z}\) with probability at least 1−δ _{ z }. Equivalently, \(p_{0}(l_{i})>\hat{\epsilon}_{z}+\epsilon_{z}\) with probability at most δ _{ z }.
However, the confidence in using \(\hat{p}_{\bar{c}}(l_{i})\) to approximate \(p_{\bar{c}}(l_{i})\) is dependent on the oracle’s largely unknown noise rate η, and thus its known upper bound η _{0} has to be taken into account when estimating a sufficient number of examples that must be returned by the corrupt target map oracle \(\mathit{EX}(\mathcal{\overline{D}}(T),c,\eta)\) so that Theorem 3 is satisfied. To this extent, by Lemmas 3 and 4, we know that for all \(0<\delta_{z},\epsilon_{z}<\frac{1}{2}\), we can use the corrupt target map oracle \(\mathit{EX}(\mathcal{\overline{D}}(T),c,\eta)\) to acquire ζ _{1}(ϵ _{ z },T,η _{0})(6Tζ _{2}(ϵ′,δ′)+ζ _{2}(ϵ″,δ″)) examples—where \(\epsilon'=\frac{\epsilon_{z}}{27}\), \(\epsilon''=\frac{\epsilon_{z}(12\eta_{0})}{2}\) and \(\delta'=\delta''=\frac{\delta_{z}}{(6T+1)\zeta_{1}(\epsilon_{z},T,\eta_{0})}\) as per Lemma 4—so that with confidence at least 1−δ _{ z }, the error in using \(\hat{p}_{\bar{c}}(l_{i})\) to estimate \(p_{\bar{c}}(l_{i})\) is at most ϵ _{ z }. Thus, if for a literal l _{ i } and some \(0<\hat{\epsilon}_{z}<\frac{1}{2}\) our estimates \(\hat{p}_{\bar{c}}(\cdot)\) indicate that \(\hat{p}_{\bar{c}}(l_{i})\leq \hat{\epsilon}_{z}\), then \(p_{\bar{c}}(l_{i})\leq \hat{\epsilon}_{z}+\epsilon_{z}\) with probability at least 1−δ _{ z }. Equivalently, we can rephrase this by noticing that \(p_{\bar{c}}(l_{i})>\hat{\epsilon}_{z}+\epsilon_{z}\) with probability at most δ _{ z }. In the remainder of this section and in Sect. A.3, we will show that \(\hat{\epsilon}_{z}=\frac{\epsilon}{8T}\), \(\epsilon_{z}=\frac{\epsilon}{8T}\), \(\delta_{z}=\frac{\delta}{4T}\) suffice when using a simulated oracle \(\mathit{EX}(\mathcal{\overline{D}}(T),c,\eta)\).
We saw in Sect. A.1 that within the context of Theorem 3, if \(c\in \mathcal{C}_{T}\) and \(h\in \cup^{T}_{i=0}\overline{\mathcal{C}}_{i}\), then an error under partial concept \(\overline{h}\) could occur if and only if \(\overline{h}(\mathbf{x})=1\wedge \overline{c}(\mathbf{x})=0\) or \(\overline{h}(\mathbf{x})=0\wedge \overline{c}(\mathbf{x})=1\) occurs. As it was previously indicated, \(\overline{h}(\mathbf{x})=1\wedge \overline{c}(\mathbf{x})=0\) can only occur by a literal in c(⋅) that evaluates to zero under x and does not exist in h(⋅). Similarly, the occurrence of event \(\overline{h}(\mathbf{x})=0\wedge \overline{c}(\mathbf{x})=1\) implies the existence in h(⋅) of a literal which is absent in c(⋅) which also evaluates to zero under x. Notice that for an arbitrary hypothesis h, the occurrence of event \(\mathit{error}(\mathcal{\overline{D}},\overline{h},\overline{c})>\epsilon\) implies the occurrence of event \(\{P_{\mathbf{x}\in\overline{\mathcal{D}}}[A(l_{i}),B(l_{i},\mathbf{x})]>\frac{\epsilon}{4T}\}\cup\{P_{\mathbf{x}\in\overline{\mathcal{D}}}[A'(l_{i}),B'(l_{i},\mathbf{x})]>\frac{\epsilon}{4T}\}\) for at least one of the 2T literals z=l _{ i } that could be formed over the T cells constituting the target map. It is easy to see that this is the case, because, otherwise, for all literals the events \(\{P_{\mathbf{x}\in\overline{\mathcal{D}}}[A(l_{i}),B(l_{i},\mathbf{x})]\leq \frac{\epsilon}{4T}\}\) and \(\{P_{\mathbf{x}\in\overline{\mathcal{D}}}[A'(l_{i}),B'(l_{i},\mathbf{x})]\leq \frac{\epsilon}{4T}\}\) occur, implying that \(\mathit{error}(\mathcal{\overline{D}},\overline{h},\overline{c})\leq \epsilon\) as we saw in Sect. A.1. Assume P _{ h }[⋅] denotes the probability of an event that depends on a hypothesis h, with h constructed using the methodologies previously described which output the approximation to a conjunction of significant literals that are not harmful. We want to show that for \(0<\epsilon,\delta<\frac{1}{2}\), we can use a corrupt partial oracle \(\mathit{EX}(\mathcal{\overline{D}},c,\eta)\) so that \(P_{h}[\mathit{error}(\mathcal{\overline{D}},\overline{h},\overline{c})\leq \epsilon]\geq 1\delta\). Equivalently, we want to show that \(P_{h}[\mathit{error}(\mathcal{\overline{D}},\overline{h}, \overline{c})>\epsilon]\leq \delta\). By the previous argument, \(P_{h}[\mathit{error}(\mathcal{\overline{D}}, \overline{h},\overline{c})>\epsilon]\leq \sum_{l_{i}}P_{h}[\{P_{\mathbf{x}\in\overline{\mathcal{D}}}[A(l_{i}), B(l_{i},\mathbf{x})]>\frac{\epsilon}{4T}\}]+ \sum_{l_{i}}P_{h}[\{P_{\mathbf{x}\in\overline{\mathcal{D}}}[A'(l_{i}),B'(l_{i},\mathbf{x})]>\frac{\epsilon}{4T}\}]\), where the summations are taking place over the 2T possible literals l _{ i }. But also notice that \(P_{\mathbf{x}\in\overline{\mathcal{D}}}[A(l_{i}), B(l_{i},\mathbf{x})]\leq P_{\mathbf{x}\in\overline{\mathcal{D}}}[B(l_{i},\mathbf{x}) A(l_{i})]\). Furthermore, we have \(P_{\mathbf{x}\in\overline{\mathcal{D}}}[A'(l_{i}),B'(l_{i},\mathbf{x})]\leq P_{\mathbf{x}\in\overline{\mathcal{D}}}[B'(l_{i},\mathbf{x})A'(l_{i})]\). Thus, to prove Theorem 3 it suffices to show that for all literals l _{ i }, it is possible to use a corrupt partial oracle \(\mathit{EX}(\mathcal{\overline{D}},c,\eta)\) so that \(P_{h}[\{P_{\mathbf{x}\in\overline{\mathcal{D}}}[B(l_{i},\mathbf{x}) A(l_{i})]>\frac{\epsilon}{4T}\}]\leq \frac{\delta}{4T}\) and \(P_{h}[\{P_{\mathbf{x}\in\overline{\mathcal{D}}}[B'(l_{i},\mathbf{x})A'(l_{i})]>\frac{\epsilon}{4T}\}]\leq \frac{\delta}{4T}\), since then \(P_{h}[\mathit{error}(\mathcal{\overline{D}},\overline{h},\overline{c})>\epsilon]\leq \delta\). As we will see next, the constraints set upon the construction of h, implicitly define the distribution out of which h is sampled. We will see in Sect. A.3 that because we are simulating the oracle (implying that for some invocations, the oracle might return invalid examples which do not satisfy the oracle’s properties), we must enforce stricter bounds on the confidence levels of \(\hat{p}_{0}(l_{i})\), \(\hat{p}_{\bar{c}}(l_{i})\), as compared to when all the examples used to estimate these probabilities are acquired from a nonsimulated oracle.
Assuming we have at our disposal a nonsimulated corrupt target map oracle \(\mathit{EX}(\mathcal{\overline{D}},c,\eta)\), where c corresponds to the target map cells containing the centroids of the singletemplateobjects, we can construct approximation functions \(\hat{p}_{0}(\cdot)\), \(\hat{p}_{\bar{c}}(\cdot)\) as previously described. By Lemma 2, if we acquire at least \(\frac{(8T)^{2}}{\epsilon^{2}}\log(\frac{16T}{\delta})\) examples to estimate p _{0}(l _{ i }) for each l _{ i }, and include in h any literal l _{ i } satisfying \(\hat{p}_{0}(l_{i})\leq\frac{\epsilon}{8T}\), then with confidence at most \(\frac{\delta}{8T}\), \(p_{0}(l_{i})>\frac{\epsilon}{8T}+\frac{\epsilon}{8T}=\frac{\epsilon}{4T}\) for any literal l _{ i } that is absent from hypothesis h(⋅) but present in c(⋅), which implies \(P_{h}[\{P_{\mathbf{x}\in\mathcal{\overline{D}}}[B(l_{i},\mathbf{x})A(l_{i})]>\frac{\epsilon}{4T}\}]\leq \frac{\delta}{8T}\). As pointed out earlier, it is possible to use Lemma 2 because the problem of estimating the significant literals does not depend on the noise rate η. So we see that \(m_{1}=2T\frac{(8T)^{2}}{\epsilon^{2}}\log(\frac{16T}{\delta})\) examples suffice to estimate p _{0}(l _{ i }) for all the 2T literals. Similarly, from Lemmas 3 and 4 we see that if \(\epsilon'=\frac{\epsilon}{27}\cdot \frac{1}{8T}\), \(\epsilon''=\frac{\epsilon(12\eta_{0})}{2}\cdot \frac{1}{8T}\), \(\delta'=\delta''=\frac{\delta/8T}{(6T+1)\zeta_{1}(\epsilon/8T,T,\eta_{0})}\), and we acquire m _{2}=ζ _{1}(ϵ/8T,T,η _{0})(6T+1)max(ζ _{2}(ϵ′,δ′),ζ _{2}(ϵ″,δ″)) examples from \(\mathit{EX}(\mathcal{\overline{D}},c,\eta)\) to produce the estimates \(\hat{p}_{\bar{c}}(l_{i})\) (for all 2T literals l _{ i }, by using the algorithm overviewed in Lemma 4), then, if any literal l _{ i } in h satisfies \(\hat{p}_{\bar{c}}(l_{i})\leq \frac{\epsilon}{8T}\), we know that with confidence at most \(\frac{\delta}{8T}\) we will have \(p_{\bar{c}}(l_{i})>\frac{\epsilon}{8T}+\frac{\epsilon}{8T}=\frac{\epsilon}{4T}\) for any literal l _{ i } present in h(⋅) but absent from c(⋅). In other words, \(P_{h}[\{P_{\mathbf{x}\in\mathcal{\overline{D}}}[B'(l_{i},\mathbf{x}) A'(l_{i})]>\frac{\epsilon}{4T}\}]\leq \frac{\delta}{8T}\), as wanted. These stricter bounds imply that \(P_{h}[\mathit{error}(\mathcal{\overline{D}},\overline{h},\overline{c})\leq \epsilon]\geq 1\frac{\delta}{2}\). As we show next, these stricter bounds on the confidence are needed because we are simulating the oracle \(\mathit{EX}(\mathcal{\overline{D}},c,\eta)\). We leave it as a simple exercise for the reader to notice that m _{2} is upper bounded by a polynomial function of T, \(\frac{1}{\epsilon}\), \(\frac{1}{\delta}\) and \(\frac{1}{12\eta_{0}}\).
A.3 Simulating a Corrupt Target Map Oracle with a Purposive Sampling Strategy
In the previous section we showed that as long as we know an upper bound \(0\leq \eta_{0}<\frac{1}{2}\) on the true noise rate η of \(\mathit{EX}(\mathcal{\overline{D}},c,\eta)\), it suffices to acquire m _{1}+m _{2} examples from a corrupt target map oracle \(\mathit{EX}(\mathcal{\overline{D}}(T),c,\eta)\) so that for any \(c\in\mathcal{C}_{T}\), within time Θ(m _{1}+m _{2}) we can find a hypothesis \(h\in \bigcup^{T}_{i=0}\overline{\mathcal{C}}_{i}\) that satisfies \(\mathit{error}(\mathcal{\overline{D}},\overline{h},\overline{c})\leq \epsilon\) with confidence at least \(1\frac{\delta}{2}\). Notice that m _{1} and m _{2}, as defined in the previous section, are polynomial functions of T, \(\frac{1}{\epsilon}\), \(\frac{1}{\delta}\) and \(\frac{1}{12\eta_{0}}\). We now wish to investigate the conditions under which, the tools that Theorem 3 states we have at our disposal, allow us to simulate a corrupt target map oracle for each one of its m _{1}+m _{2} invocations.
For notational simplicity, let p correspond to the true value of any one of the 6T+1 probabilities that need to be approximated during each of the ζ _{1}(ϵ,T,η _{0}) iterations of Lemma 4. Similarly, let \(\hat{p}(\eta;m_{3})\) denote a random variable of the approximation to the probability p, built using m _{3} examples acquired from the oracle, as per Lemma 2. Let \(\epsilon_{p}=\min(\frac{\epsilon}{27}\frac{1}{8T}, \frac{\epsilon(12\eta_{0})}{2}\frac{1}{8T})\) and \(\delta_{p}=\frac{\delta/4T}{(6T+1)\zeta_{1}(\epsilon/8T,T,\eta_{0})}\). Notice that this ϵ _{ p } is sufficiently small to guarantee that for any of the 6T+1 probabilities that are estimated during each iteration of Lemma 4, the desired error and confidence bounds will be satisfied. As discussed in the previous section, we want to simulate the corrupt target map oracle so that \(P[\hat{p}(\eta;m_{3})p>\epsilon_{p}]\leq\delta_{p}\) where \(P[\hat{p}(\eta;m_{3})p>\epsilon_{p}]\) denotes the probability that the random variable \(\hat{p}(\eta;m_{3})\) satisfies the event \(\hat{p}(\eta;m_{3})p>\epsilon_{p}\). Define the event \(A''_{z}(m_{3})\) as follows:

\(A''_{z}(m_{3})\): All m _{3} examples used to calculate \(\hat{p}(\eta;m_{3})\) are sampled from \(\mathit{EX}(\overline{\mathcal{D}}(T),c,\eta)\).
Thus, we see that by Baye’s theorem and by conditioning on \(A''_{z}(m_{3})\) and its negation, \(P[\hat{p}(\eta;m_{3})p>\epsilon_{s}]\leq P[\{\hat{p}(\eta;m_{3})p>\epsilon_{s}\}A''_{z}(m_{3})]+P[\neg A''_{z}(m_{3})]\).
Thus, by Lemma 2, and as previously derived, we see that if \(m_{3}=\lceil\frac{1}{\epsilon_{p}^{2}}\log(\frac{4}{\delta_{p}})\rceil\) then \(P[\{\hat{p}(\eta;m_{3})p>\epsilon_{p}\}\mid{}A''_{z}(m_{3})]\leq\frac{\delta_{p}}{2}\). So it suffices to describe an algorithm so that \(P[\neg A''_{z}(m_{3})]\leq\frac{\delta_{p}}{2}\) holds, since then \(P[\hat{p}(\eta;m_{3})p>\epsilon_{s}]\leq\delta_{p}\). This will show that we can successfully simulate the corrupt target map oracle given the tools that we can assume to have at our disposal by Theorem 3. Define event E _{ z }(i,m _{3}) as follows:

E _{ z }(i,m _{3}): The ith of the m _{3} examples used to calculate \(\hat{p}(\eta;m_{3})\) is not sampled from \(\mathit{EX}(\mathcal{\overline{D}}(T),c,\eta)\).
We can see that \(P[\neg A''_{z}(m_{3})]=P[\bigcup^{m_{3}}_{i=1}E_{z}(i,m_{3})]\leq \sum^{m_{3}}_{i=1}P[E_{z}(i,m_{3})]\). We will show that with a purposive sampling strategy we can efficiently sample the search space so that \(P[E_{z}(i,m_{3})]\leq\frac{\delta_{p}}{2m_{3}}\) for i=1,…,m _{3}, and for all the 2T possible literals z.
Let \(\eta'_{0}=\frac{\eta}{T}\). Notice that \(\frac{1}{12\eta'_{0}}=\frac{T}{T2\eta}\) and that \(\frac{1}{T2\eta}\leq \frac{1}{12\eta}\leq \frac{1}{12\eta_{0}}\) if η _{0} is an upper bound on the noise rate η of the corrupt target map oracle. Thus, we see that if we present an algorithm that with respect to \(\frac{1}{12\eta'_{0}}\) has a polynomial running time, the contribution of \(\frac{1}{12\eta'_{0}}\) to the running time will also be polynomial in terms of T and \(\frac{1}{12\eta_{0}}\). We will provide an algorithm that allows us to sample the search region of each of the T target map cells, in order to determine with an error of at most \(\eta'_{0}\) and with a confidence of at least \(1\frac{\delta_{p}}{2m_{3}T}\) whether the cell contains or not the centroid of any instance of the k singletemplateobjects O(t _{1}),…,O(t _{ k }) that constitute the object we are searching for. Equivalently, for each cell j, the algorithm will output a literal l _{ j }=b _{ j } if the algorithm has deduced that cell j contains an instance of the object, and will output literal l _{ j }=¬b _{ j } if the algorithm has deduced that cell j does not contain an instance of the object. For each literal l _{ j } that is output, with confidence at least \(1\frac{\delta_{p}}{2m_{3}T}\), and with an independent probability of error of at most \(\eta'_{0}\), literal l _{ j } is incorrect, and the correct literal is ¬l _{ j }. Then, we see that with confidence at least \(1\frac{\delta_{p}}{2m_{3}}\) concept \(\hat{c}=l_{1}\wedge l_{2}\wedge\cdots\wedge l_{T}\) denotes a T−CNF formula which is incorrect in at least one of its literals with a probability of at most \(\eta'_{0}T=\eta<\frac{1}{2}\). We can thus use \(\hat{c}\) to simulate \(\mathit{EX}(\overline{\mathcal{D}}(T),c,\eta)\): For every sample x output by \(\overline{\mathcal{D}}(T)\), and assuming \(\overline{\hat{c}}\) is the partial concept of \(\hat{c}\), we output \(\langle\mathbf{x},\overline{\hat{c}}(\mathbf{x})\rangle\), which is equivalent to acquiring an independent sample from oracle \(\mathit{EX}(\overline{\mathcal{D}}(T),c,\eta)\), since as we will see, the label of each example \(\langle\mathbf{x},\overline{\hat{c}}(\mathbf{x})\rangle\) is independently corrupted with probability at most η, as desired. Notice that we cannot output \(\hat{c}\) as our final answer, since we have no guarantees that it satisfies the given error and confidence constraints of Theorem 3 for an arbitrary target map distribution \(\overline{D}(T)\): On a randomly created \(\hat{c}\) and for any \(l_{i}\in \hat{c}\), the probability that l _{ i } is the correct literal for cell i, does not necessarily lie in \((\frac{1}{2},1]\), due to the effect on the total probability of error that is induced by the confidence bound \(\frac{\delta_{p}}{2m_{3}T}\) and the error rate \(\eta'_{0}\) defined above. The unknown noise rate of the corrupt feature detection oracle, whose value does not depend on ϵ, further complicates the problem.
As indicated in Theorem 3, we assume that the object detection concept class contains objects that are composed of singletemplateobjects of the form \(\mathbf{O}(t)=(\mathcal{F}(t),\mathcal{D}(t),\allowbreak\epsilon(t),\mathcal{G}_{o_{1}(t)},\mathcal{P}_{o_{2}(t)},\mathcal{S}_{o_{3}(t)},L_{t},\theta_{t})\), where o _{1}(t), o _{2}(t), o _{3}(t) are upperbounded, for any singletemplateobject t we are dealing with. As per Definition 31 and Theorem 3, we are provided with a sensor \(\overline{\varGamma}=(\mathcal{G}_{n_{1}},\mathcal{P}_{n_{2}},\mathcal{S}_{n_{3}},\overline{\mathcal{V}}_{n_{4}}, \phi_{n_{2},L_{t}},\allowbreak \gamma(\mathcal{P}_{n_{2}}^{\frac{1}{3}}L_{t}))\) (where \(n_{2}=\lceil 3\lg(2\lfloor\frac{L_{c}+\overline{L}_{t}}{2L_{t}}\rfloor+1)\rceil\), \(\overline{L}_{t}=(\mathcal{P}_{o_{2}(t)}^{\frac{1}{3}}1)L_{t}\), n _{1}=o _{1}(t) and n _{3}=o _{3}(t)) that defines the object detection concept class \(\mathcal{C}(\overline{\varGamma},\mathcal{M}(\bar{l},\overline{\varGamma}, \mathcal{R}_{1},\mathcal{R}_{2}))\) containing the object detection concept whose translated instances we wish to localize in the search space. Notice that the encoding length of any feature position lying in the search space SSP, is \(n'_{2}\in\mathcal{O}(n_{2}+\lg(T))\).
As previously discussed, we want to demonstrate that \(P[E_{z}(i,m_{3})]\leq\frac{\delta_{p}}{2m_{3}}\), assuming z=l _{ q } for q∈{1,…,T}, l _{ q }∈{b _{ q },¬b _{ q }}. We previously indicated that it suffices to provide an algorithm which discovers for each of the T cells, with confidence at least \(1\frac{\delta_{p}}{2m_{3}T}\) and with error at most \(\eta'_{0}\), whether the cell has any instance of the k singletemplateobjects lying inside it. Let \(\delta_{s}=\frac{\delta_{p}}{2m_{3}T}\). Within the context of Theorem 3, and by using a purposive sampling strategy, consider the use of active sensors \(\varGamma'(\varGamma,\overline{\varGamma},\mathcal{D}(\mathcal{V}_{n}),j)\) for each j∈{1,…,T}, where \(\mathcal{D}(\mathcal{V}_{n})\) assigns a uniform distribution to each sensor state in \(\mathcal{V}_{n}\), to obtain a certain number of samples. Let \(c_{1}=\mathcal{G}_{n_{1}}\cdot \mathcal{P}_{n_{2}}\cdot \mathcal{S}_{n_{3}}\) denote the number of features in each cell’s search space. Furthermore, let \(\mathcal{B}\) denote the elements in \(\mathcal{P}_{n_{2}}\) which map into the cell volume γ(L _{ c }) under \(\phi_{n_{2},L_{t}}\). Also let \(c_{2}=\mathcal{B}\approx\lceil\frac{L_{c}}{L_{t}}+1\rceil^{3}\) denote the number of elements in \(\mathcal{B}\). Notice that c _{2}≤c _{1}. By Theorem 1, we see that if we acquire \(m_{4}=\varTheta(\frac{c_{1}}{\epsilon_{0}}\log(\frac{3}{\delta_{s}})+\frac{c_{1}d_{1}}{\epsilon_{0}}\log(\frac{c_{1}}{\epsilon_{0}}))\) samples using this active sensor, where \(d_{1}=\mathit{VCD}[\mathcal{M}_{1}(\bar{l},\overline{\varGamma},\mathcal{R}_{1},\mathcal{R}'_{1})]\), and since we are sampling with a uniform probability distribution, then at least \((1\frac{\delta_{s}}{3})\times 100\) percent of the time we will have acquired enough samples so that our visibility algorithm can create a “sufficiently good” invertible visibility representation σ _{1}∈{0,1}^{l}. This representation subsumes the bounding boxes of any of the singletemplateobjects that are centred anywhere in cell j, so that with confidence at least \(1\frac{\delta_{s}}{3}\), concept \(h_{\alpha}\triangleq \mathcal{R}_{1}(\sigma_{1})\) is incorrect on at most \(\frac{\epsilon_{0}}{c_{1}}\times 100\) percent of all the sensor states which map into features lying in the search space of cell j. This in turn implies that for any one of the c _{1} features, at most ϵ _{0}×100 percent of the sensor states from which the feature can be sensed, map to an incorrect visibility truth label under concept h _{ α }. Notice that c _{1} grows at an exponential rate if n _{1}, n _{2} and n _{3} are allowed to grow without a bound, which is why in Theorem 3 we assume the polynomial upper bound for c _{1}. In practice this means that the value of c _{1} will have to be fairly “small” in that is does not grow very fast as the overall search space size increases. In other words, the region or cell search space where we attend to in order to construct a representation, has to be fairly small in terms of the encoding length of the features it contains. As we will see later, if all features are always visible, the problem is simplified and we do not need to place this upper bound on n _{1}, n _{2}, n _{3}.
As previously discussed, for the k singletemplateobjects O(t _{1}),…,O(t _{ k }) defining the object we are searching for, we can associate singletemplateobject detection concepts \(c_{t_{1}},\ldots,c_{t_{k}}\) which are equal to 1 iff any instance of the corresponding singletemplateobject is centred somewhere in volume γ(L _{ c }). The disjunction of these concepts (\(c_{\mathtt{union}}=c_{t_{1}}\vee\cdots\vee c_{t_{k}}\)) defines the object whose translated instances we are searching for. For each singletemplateobject t _{ y } (where y∈{1,…,k}), we are provided with a distribution \(\mathcal{D}(t_{y})\) defining the distribution with respect to which the features in \(\mathcal{F}(t_{y})\) are sampled when determining the presence of the singletemplateobject. Let \(\overline{v}_{c}\in\overline{\mathcal{V}}_{n_{4}}\) be some arbitrary constant element. Furthermore, let \(p'\in\mathcal{P}_{n_{2}}\) denote any feature position which satisfies \(\phi_{n_{2},L_{t}}(p')\in\gamma(L_{c})\). We define a sensor state distribution \(\mathcal{D}'(t_{y},\mathcal{V}_{n},p')\) (where y∈{1,…,k} denotes a singletemplateobject), so that the probability of sampling \((g,p,s,\overline{v}_{c})\in\mathcal{V}_{n}\) from distribution \(\mathcal{D}'(t_{y},\mathcal{V}_{n},p')\) equals the probability of sampling (g,p″,s) from \(\mathcal{D}(t_{y})\), where \(p''=\phi^{1}_{o_{2}(t),L_{t}}(\phi_{n_{2},L_{t}}(p)\phi_{n_{2},L_{t}}(p'))\). Of course if \(\phi_{n_{2},L_{t}}(p)\phi_{n_{2},L_{t}}(p')\) does not belong to the domain of \(\phi^{1}_{o_{2}(t),L_{t}}\) then, the probability of sampling \((g,p,s,\overline{v}_{c})\) is zero. In other words, any sample vector returned by \(\mathcal{D}'(t_{y},\mathcal{V}_{n},p')\) is guaranteed to contain \(\overline{v}_{c}\) as its fourth vector entry. Notice that \(\lambda(\mathcal{D}'(t_{y},\mathcal{V}_{n},p'))\) denotes the distribution which returns the feature corresponding to any sensor state returned by \(\mathcal{D}'(t_{y},\mathcal{V}_{n},p')\). Furthermore, let P′ assign a uniform probability of \(\frac{1}{c_{2}}\) to every element in \(\mathcal{B}\). We can use the invertibility of the representation σ _{1} that was just derived, to define a function inv(f,σ _{1}) which returns a sensor state in \(\mathcal{V}_{n}\) from which feature \(f\in\mathcal{G}_{n_{1}}\times\mathcal{P}_{n_{2}}\times\mathcal{S}_{n_{3}}\) is visible according to σ _{1} (if one exists). Otherwise, inv(f,σ _{1}) returns an arbitrary sensor state in \(\mathcal{V}_{n}\) which maps to feature f according to λ(⋅). Then, we see that \(\mathit{inv}(\lambda(\mathcal{D}'(t_{y},\mathcal{V}_{n},P')),\sigma_{1})\) is a distribution/random variable returning sensor states which map to features positioned inside object bounding boxes centred in γ(L _{ c }).
Let Y be a uniform random variable over {1,…,k} and assume Y is independent of P′ such that ∀y∈{1,…,k} and \(\forall p'\in\mathcal{B}\), \(P[Y=y,P'=p']=\frac{1}{kc_{2}}\). Assume that for any target map cell j, we have at our disposal the active sensor \(\varGamma'(\varGamma,\overline{\varGamma},\mathit{inv}(\lambda(\mathcal{D}'(t_{Y},\mathcal{V}_{n},P')),\sigma_{1}),j)\) and the corresponding feature detection oracle \(\mathit{EX}(\varGamma', \mu'_{\alpha},\mu',\frac{\eta'}{s})\) where \((\mu'_{\alpha},\mu')\) is the unknown target scene defined in Theorem 3, and \(\frac{\eta'}{s}\) is the unknown noiserate of Theorem 3. Furthermore, assume that whenever the oracle returns an example labelled with an α (i.e., a ‘not visible’ label), we change the label to a zero (a ‘feature not present’ label). By Lemma 5, since P′, Y are uniform and independent, and since we are using an ϵ _{0}occluded scene representation, we see that if we acquire at least \(m_{5}=\varTheta(\frac{c_{2}k}{c_{\mathtt{cam}}\epsilon_{m}}\log(\frac{3}{\delta_{s}})+\frac{c_{2}kd_{2}}{c_{\mathtt{cam}}\epsilon_{m}}\log(\frac{c_{2}k}{c_{\mathtt{cam}}\epsilon_{m}}))\) examples (where ϵ _{ m }=min{ϵ(t _{1}),…,ϵ(t _{ k })}), we will have built a representation σ _{2} using Λ _{ fb }, so that with confidence at least \(1\frac{\delta_{s}}{3}\), and ∀y∈{1,…,k}, \(\forall p'\in\mathcal{B}\), the feature binding concept \(h\triangleq \mathcal{R}'_{2}(\sigma_{2})\), is incorrect with respect to the target scene μ with a probability of at most c _{ cam } ϵ(t _{ y }) on a random sample from \(\lambda(\mathcal{D}'(t_{y},\mathcal{V}_{n},p'))\), assuming a noise free feature detection oracle and that \(d_{2}=\mathit{VCD}[\mathcal{M}_{2}(\bar{l},\overline{\varGamma},\mathcal{R}_{2},\mathcal{R}'_{2})]\). Thus, the representation σ _{2} is of sufficiently high quality to detect any instance of a singletemplateobject that is centred anywhere in cell j (see Definition 37). Notice that since c _{2} is constant for constant cell sizes and constant sampling distances \(L_{t_{1}},\ldots,L_{t_{k}}\), then m _{5} is a polynomial function of k, \(\frac{1}{\epsilon}\), \(\frac{1}{\delta}\), \(\frac{1}{12\eta_{0}}\), T, \(\frac{1}{\epsilon_{0}}\), \(\frac{1}{c_{\mathtt{cam}}}\), \(\frac{1}{\epsilon(t_{1})},\ldots,\frac{1}{\epsilon(t_{k})}\). Notice that the smaller d _{2} is with respect to \(n_{1},n_{2},n_{3},n_{4},\overline{l}\), the smaller m _{5} becomes. While n _{4} does not affect the input lengths for feature binding concepts, it affects feature visibility and consequently it affects the presence of features in the scene, so it is important that d _{2} grows slowly with n _{4}. Ideally there is very little occlusion in the scene and n _{4} does not affect d _{2} significantly.
Assume that for target map cell j we have built feature binding representation σ _{2}∈{0,1}^{l}. Thus, we see, as described in Sect. A.1, with probability at most \(\frac{\delta_{s}}{3}\) the event \(\bigcup_{y,p'}\{\mathit{error}(\mathit{inv}(\lambda(\mathcal{D}'(t_{y},\mathcal{V}_{n},p')),\sigma_{1}),\mathcal{R}(\sigma_{2}),\mu)>c_{\mathtt{cam}}\epsilon(t_{y})\}\) will hold, where μ is the true feature binding concept that generated the features in cell j’s search space, y∈{1,…,k} and \(p'\in\mathcal{B}\). That is, with probability at least \(1(\frac{\delta_{s}}{3}+\frac{\delta_{s}}{3})=1\frac{2\delta_{s}}{3}\), we will have built a representation σ _{1} and a representation σ _{2} that includes a sufficient number of features so that any instance of an object centred somewhere in cell j is detectable. For any \(0<\eta_{1}<\frac{1}{2}\), we see that if the corrupt feature detection oracle’s noise rate is bounded by \(\frac{\eta_{1}}{(m_{4}+m_{5})}\), then with probability at most η _{1}, there will be an error in the labels of one of the training set’s labels. To put it differently, if s=m _{4}+m _{5} in Theorem 3, then with probability at least 1−η _{1}, the feature representation built will not have used an incorrectly labelled training example. A final observation is that if \(\mathit{VCD}[\mathcal{M}_{1}(\bar{l},\varGamma,\mathcal{R}_{1},\mathcal{R}'_{1})]\), \(\mathit{VCD}[\mathcal{M}_{2}(\bar{l},\varGamma,\mathcal{R}_{2},\mathcal{R}'_{2})]\) were exponential functions of their input parameters \(\bar{l}\), n _{1}, n _{2}, n _{3}, n _{4}, the above described error bounds \(\frac{\eta_{1}}{(m_{4}+m_{5})}\) would be significantly smaller (since m _{4}+m _{5} would be larger), thus making related vision algorithms significantly more fragile. This demonstrates the need for “easy” scene representations, in order to mitigate the inevitable errors of lowlevel feature detectors in challenging scenes (e.g., ‘camouflaged’ objects).
Assume that before the search starts, we have at our disposal a sequence of c _{3}=T(m _{1}+m _{2}) object detection concept representations \(\sigma'_{1},\ldots,\sigma'_{c_{3}}\in\mathcal{H}_{g(l;\mathbf {p})}\) with independent errors (see Theorems 2 and 3), that can detect any instance of the c _{ cam }camouflaged singletemplateobjects O(t _{1}),…,O(t _{ k }) (i.e., can approximate c _{ union }), assuming the algorithms are given as input the appropriate scene representation. Assume the c _{3} object representations were trained on random scene samples from distribution \(\mathcal{D}(\mathcal{M}(\bar{l},\overline{\varGamma},\mathcal{R}_{1}, \mathcal{R}_{2}))\) (see Definition 39 and Theorem 3) so that for all i∈{1,…,c _{3}}, and with a confidence of at least \(1\frac{\delta_{s}}{3}\), the resulting classification error of object detection concept \(\mathcal{R}_{\mathtt{classifier}}(\sigma'_{i})\) is less than or equal to \(\frac{\eta_{2}}{r}\) for some \(0<\eta_{2}<\frac{1}{2}\), where function r>1 was defined in Definition 39 and Theorem 3. Notice that since r is a polynomial function of T, \(\frac{1}{\epsilon}\), \(\frac{1}{\delta}\), n _{1}, n _{2}, n _{3} and n _{4}, the training is slowed down at most by a factor of Θ(rlog(r)) due to r (see Theorem 1), which is the price we pay for making the detectors capable of generalizing on target scene distributions that are not identical to the training distributions. It is easy to see that on a randomly generated target scene from \(\mathcal{D}(\mathbf{F},r)\) (see Theorem 3), the detection error on the correct/groundtruth representation corresponding to any target map cell’s search space, is at most η _{2} with confidence at least \(1\frac{\delta_{s}}{3}\). If for each one of the T(m _{1}+m _{2}) times we need to build a representation of the features in the search space of one of the T cells, we use a distinct object detection concept representation (out of the set of c _{3} detectors) to detect whether the cell contains at least one of the k possible singletemplateobjects, the classification errors will remain independent. Notice however, that the above argument does not take into consideration the fact that we do not have access to the ground truth representation of the generated target scene, and we instead have to approximate the target scene representation by acquiring random samples (acquired with an active sensor) and applying a feature binding algorithm on these samples to build a good enough scene representation to detect any of the k singletemplateobjects, as previously described. This can change the distribution of scene representations we have at our disposal. This problem is circumvented as follows. During the training of the c _{3} detectors, for each random sample acquired from \(\mathcal{D}(\mathcal{M}(\bar{l},\overline{\varGamma},\mathcal{R}_{1},\mathcal{R}_{2}))\), we build a representation for the random search region using the corresponding active sensor \(\varGamma'(\varGamma,\overline{\varGamma},\mathit{inv}(\lambda(\mathcal{D}'(t_{Y},\mathcal{V}_{n},P')),\sigma_{1}),j)\), as previously described: by acquiring with the active sensor the same number of m _{5} samples as would be acquired during online search, and by retaining the representations which indeed have an error of at most \(\frac{c_{\mathtt{cam}}\epsilon_{m}}{c_{2}k}\), we can provide the object detection learning algorithm with a sufficient number of labelled scene representations so that with confidence at least \(1\frac{\delta_{s}}{3}\) the detector has a classification error of at most \(\frac{\eta_{2}}{r}\) on such a random representation. For a target scene sampled from \(\mathcal{D}(\mathbf{F},r)\), consider the approximated representation for a cell search region in this target scene. Given that the constructed representation from the target scene also satisfies the same error constraints, we see that if this constructed representation is provided as input to one of the c _{3} constructed detectors, then with confidence at least \(1\frac{\delta_{s}}{3}\), the classification error for the existence of one of targets t _{1},…,t _{ k } in the cell is at most \((\frac{\eta_{2}}{r})(r)=\eta_{2}\). We see that this error bound holds since the object detector is trained by uniformly choosing a y∈{1,…,k} which defines the sampling distribution with respect to which the search space is sampled (see Lemma 5). In other words, since we use one object detector for each of the T(m _{1}+m _{2}) times that we need to invoke corrupt target map oracle \(\mathit{EX}(\mathcal{\overline{D}}(T),c,\eta)\), then \(\overline{m}=c_{3}\) in Theorem 3 suffices. This shows that with a purposive sampling strategy we can achieve \(P[E_{z}(i,m_{3})]\leq \frac{\delta_{p}}{2m_{3}}\) ∀i∈{1,…,m _{3}}, as wanted.
Thus, we see that for each of the T(m _{1}+m _{2}) times we call an object detection oracle, with confidence at least \(1\frac{\delta_{p}}{2m_{3}T}\) the worst case classification error is upper bounded by η _{1}+η _{2} (see Theorem 3), and by Lemma 1 we see that the classification error is also independent for each invocation of the corrupt target map oracle simulation. That is, as long as the noiserates satisfy \(\eta_{1}+\eta_{2}\leq \eta'_{0}=\frac{\eta}{T}\leq \frac{\eta_{0}}{T}\), we can successfully simulate the corrupt target map oracle with a noiserate that is upper bounded by η _{0}, as desired. For example \(\eta_{1}=\eta_{2}=\frac{\eta}{2T}\) gives sufficiently good bounds. We, thus, notice that the smaller T is, the larger \(\frac{\eta_{0}}{T}\) is, posing fewer restrictions on how good feature detection and each object detector must be. Notice that while a decrease in η _{2} in general speeds up the online recognition speed, it also leads to an increase in the number of samples needed to train the corresponding object detectors, since then the error bound is η _{2} and the training speed scales at a rate of \(\frac{1}{\eta_{2}}\) (Theorem 1) rather than a rate of \(\frac{1}{12\eta_{2}}\). In conclusion, we have proven Theorem 3, where the total number of times that the active sensor needs to be called (due to an invocation of the corrupt feature detection oracle) is given by T(m _{1}+m _{2})(m _{4}+m _{5}).
Lemma 5
Let 0≤ϵ≤1 and assume E(i) denotes an event that depends on the value of i∈{1,…,N}. If R is a random variable, with a uniform distribution over {1,…,N}, then \(P[E(R)]\leq \frac{\epsilon}{N}\) implies that for all i∈{1,…,N}, P[E(R)R=i]≤ϵ.
Proof
By Baye’s theorem \(P[E(R)]=\sum^{N}_{i=1}P[E(R)R=i]P[R=i]\). Since R has a uniform distribution, \(P[R=i]=\frac{1}{N}\) for all i. Thus if \(\sum^{N}_{i=1}P[E(R)R=i]P[R=i]\leq\frac{\epsilon}{N}\) then \(\sum^{N}_{i=1}P[E(R)R=i]\leq\epsilon\) which implies that for all i∈{1,…,N}, P[E(R)R=i]≤ϵ. □
We are now in a position to define and prove the object recognition problem, using our proof of Theorem 3.
Theorem 6
(The Object Recognition Problem is Efficiently Learnable Under a Purposive Sampling Strategy)
Consider N sets \(\mathbf{O}_{k_{1}},\ldots,\mathbf{O}_{k_{N}}\), where for every i∈{1,…,N}, set \(\mathbf{O}_{k_{i}}\) contains k _{ i } singletemplateobjects that define an object detection concept \(c^{i}_{\mathtt{union}}\in\mathcal{C}(\overline{\varGamma},\mathcal{M}(\bar{l},\overline{\varGamma},\mathcal{R}_{1},\mathcal{R}_{2}))\), as per the formulation of \(\mathcal{C}(\overline{\varGamma}, \mathcal{M}(\bar{l},\overline{\varGamma},\mathcal{R}_{1},\mathcal{R}_{2}))\) in Theorem 3. If we expand Theorem 3 by adding the task of determining for all i∈{1,…,N} and under target map distribution \(\mathcal{\overline{D}}_{i}(T)\), the target map cells where at least one instance of a singletemplateobject contained in \(\mathbf{O}_{k_{i}}\) is centred, then under a purposive sampling strategy, the problem is efficiently learnable. We assume that the usual polynomial upper bounds on N, k _{1},…,k _{ N } apply.
Proof
The problem is solvable by applying N times the localization algorithm described in Appendix A. In more detail, for each cell j∈{1,…,T} in the target map, let us define N boolean variables {b _{ j },b _{ j+T},…,b _{ j+(N−1)T}}. Let us define distribution \(\mathcal{\overline{D}}(NT)\) over set \(\mathbf{X}^{\{0,1,\alpha\}}_{NT}\), where the probability of sampling \(\mathbf{x}=(x_{1},\ldots,x_{NT})\in \mathcal{\overline{D}}(NT)\) equals the probability of independently sampling \(\mathbf{x}^{1}=(x^{1}_{1},\ldots,x^{1}_{T})\in\mathcal{\overline{D}}_{1}(T),\ldots, \mathbf{x}^{N}=(x^{N}_{1},\ldots, x^{N}_{T})\in\mathcal{\overline{D}}_{N}(T)\) such that ∀j∈{1,…,T} we have that \(x_{j}=x^{1}_{j}\), \(x_{j+T}=x^{2}_{j}\), \(x_{j+2T}=x^{3}_{j},\ldots, x_{j+(N1)T}=x^{N}_{j}\). It is straightforward to see that by using the algorithms discussed in Appendix A, and for each object \(\mathbf{O}_{k_{i}}\), we can find a TCNF formula f _{ i }, consisting of conjunctions of literals corresponding to the boolean variables {b _{1+iT},b _{2+iT},…,b _{(i+1)T}}, so that with confidence at most \(\frac{\delta}{N}\), the probability that the error of f _{ i } (the error in localizing object \(\mathbf{O}_{k_{i}}\)), under distribution \(\mathcal{\overline{D}}(NT)\), is greater than \(\frac{\epsilon}{N}\). But if we let h=f _{1}∧f _{2}∧⋯∧f _{ N }, then with confidence at least 1−δ the error of h under \(\mathcal{\overline{D}}(NT)\) is at most ϵ, as wanted. □
Appendix B: The Effects of a Passive Sampling Strategy on the Problem Complexity
We describe a passive sampling strategy to the object localization and recognition problem, that allows us to highlight the difficulties that arise if we do not follow a purposive approach to the sampling strategy. We consider a worst case scenario, where none of the cell search spaces intersect each other, thus maximizing \(\mathcal{P}_{n'_{2}}\) in extended sensor Γ (i.e., \(\lg(\mathcal{P}_{n'_{2}})=\varTheta(n_{2}+\lg(T))\)).
As per Theorem 4, under a passive sampling strategy for the object localization problem, we have access to a single active sensor given by \(\varGamma'(\varGamma,\overline{\varGamma},\mathcal{D}(\mathcal{V}_{n}),P')\). Under a passive sampling strategy we have little purposive control over the features sampled by the sensor, since any feature and any sensor state in the search space has an equal chance of being sampled for each invocation of the active sensor, due to the uniform sensor state distribution \(\mathcal{D}(\mathcal{V}_{n})\), due to the uniform distribution of P′, and because none of the cell search spaces intersect. The question arises as to how this affects the efficiency of the corrupt target map oracle simulation. We demonstrate that the inability to control the sampling strategy by the use of prior knowledge, can add a significant layer of complexity to simulating the oracle with the desired error and confidence bounds described in previous sections. This provides good evidence that the addition of at least a certain degree of purposiveness in a search strategy can lead to significant improvements in search efficiency. Notice that empirical evidence from previous work (Rimey and Brown 1994; Wixson and Ballard 1994; Andreopoulos et al. 2011) demonstrates significant differences in the search efficiency between purposive and passive sampling strategies.
Assume we want to simulate an invocation of a corrupt target map oracle \(\mathit{EX}(\overline{\mathcal{D}}(T),c,\eta)\) so that for all 2T literals z=l _{ q }∈{b _{ q },¬b _{ q }}, q∈{1,…,T}, we have \(P[E_{z}(i,m_{3})]<\frac{\delta_{p}}{2m_{3}}\), as defined in Sect. A.3. We can approach the problem of determining a good visibility representation in two ways. We could sample the entire generated scene until we are guaranteed with confidence at least \(1\frac{\delta_{p}}{6m_{3}}\), that the visibility error is at most \(\frac{\epsilon_{0}}{Tc_{1}}\), where Tc _{1} is an upper bound on the total number of features lying in our search space. Similarly to the purposive sampling strategy, this would guarantee that for any given feature, at most ϵ _{0}×100 percent of the sensor states which map to that feature, have an incorrect visibility in the constructed representation. By Lemma 6 below, we see that the VCdimension of such a scene has a tight upper bound determined by a function that grows linearly with T. By Theorem 1 we see that the necessary number of samples for a scene with VCdimension d and error \(0<\epsilon<\frac{1}{2}\) is at least \(\varTheta(\frac{d}{\epsilon})\), implying that we need a number of samples that grows at least as fast as the VCdimension of the search space. Section 4 shows however, that, constructing a single representation (whose size grows as T grows) for an arbitrarily large scene (e.g., as T grows), can lead to a fragile object detector.
Alternatively, and as we did with the purposive sampling strategy, we could invoke the active sensor a sufficient number of times so that with confidence at least \(1\frac{\delta_{s}}{3}\) (recall \(\delta_{s}=\frac{\delta_{p}}{2m_{3}T}\)) we have independently and uniformly sampled inside any given cell j∈{1,…,T} a sufficient number of times to obtain a good local representation of each cell search space. By Lemma 7 below we see that \(\log(\frac{1}{\delta})T+1\) samples suffice to obtain with a confidence of at least 1−δ (for any \(0<\delta<\frac{1}{2}\)) a sample from inside a given target map cell’s search space. By Theorem 1, we see that if we sample \(m_{6}=\varTheta(\frac{c_{1}}{\epsilon_{0}}\log(\frac{3}{\delta_{s}})+\frac{c_{1}d_{1}}{\epsilon_{0}}\log(\frac{c_{1}}{\epsilon_{0}}))\) times inside a given cell search space, with confidence at least \(1\frac{\delta_{s}}{3}\) we will have constructed a representation that allows us to determine the visible features of each viewpoint sufficiently accurately. Thus, by Lemma 7, if under the passive sampling strategy we call the active sensor \(m_{7}=\log(\frac{3m_{6}}{\delta_{s}})T+1\) times, with confidence at least \(1\frac{\delta_{s}}{3m_{6}}\) we have sampled inside a given cell search space once. Thus, \(m_{6}m_{7}=m_{6}\log(\frac{3m_{6}}{\delta_{s}})T+m_{6}\) calls of the active sensor suffice so that with confidence at least \(1\frac{\delta_{s}}{3}\) we have a visibility representation of the desired cell, similar to the algorithm described in Sect. A.3.
Notice that the above use of a passive sampling strategy increases the best known lower bound on the sufficient number of samples for constructing the representation of a cell search space, by a factor of at least \(\log(\frac{3m_{6}}{\delta_{s}})T+1\), since m _{6}>m _{4}, which is a nontrivial value. Unfortunately, in contrast to the purposive sampling strategy, we cannot use the constructed visibility representation to guide the sensor, since by definition, a passive sampling strategy uniformly samples every sensor state. Furthermore, construction of the feature binding representation requires binary labelled examples as to the presence or absence of the respective feature—recall that the feature detection oracle returns a ternary label from set {α,0,1}, and we do not have access to a distinct oracle for each of the visibility and feature binding concept classes.
The greatest difficulty from a complexitywise perspective, arises in determining the feature binding representation (see Sect. A.3) with a confidence of at least \(1\frac{\delta_{s}}{3k}\), for each of the k singletemplateobjects whose translated instances we are searching for. The difficulty lies in that we associate a sampling distribution \(\mathcal{D}(t_{y})\) with the features \(\mathcal{F}(t_{y})\) of each singletemplateobject t _{ y }. Our active sensor does not allow us to adjust the sampling distribution since by definition, it samples each sensor state in \(\mathcal{V}_{n}\) with a uniform distribution and it samples inside each target map cell with a uniform distribution P′. This gives rise to the question of how the upper bound on the necessary number of samples is affected.
Notice that the purposive sampling strategy allows us to observe a feature from its visible viewpoints, when constructing the feature binding representation. The use of a passive sampling strategy, and the inability to inhibit the occluded/nonvisible viewpoints during the sampling strategy, brings up the question of what happens if we attempt to directly approximate the function π(⋅,⋅) (Definitions 14 and 15) corresponding to any pair of concepts \((\mu_{\alpha},\mu)\in\mathcal{M}(\bar{l},\overline{\varGamma},\mathcal{R}_{1},\mathcal{R}_{2})\), in order to obtain an approximation for μ that can be used by the object detection algorithm. In other words, assume that (h _{ α },h) is the approximating pair of concepts sampled from set \(\mathcal{M}_{12}=\mathcal{M}_{1}(\bar{l},\overline{\varGamma},\mathcal{R}_{1},\mathcal{R}'_{1})\times \mathcal{M}_{2}(\bar{l},\overline{\varGamma},\mathcal{R}_{2},\mathcal{R}'_{2})\), where \(\mathcal{M}(\bar{l},\overline{\varGamma},\mathcal{R}_{1},\mathcal{R}_{2})\subseteq \mathcal{M}_{12}\). Then, we want to find the number of samples that may be returned by the feature detection oracle so that we can approximate, sufficiently well, function π(μ _{ α }(v),μ(λ(v))) (Definitions 14 and 15) for any sensor state v returned by some sensor state distribution, so that the approximation h for μ is sufficiently good to detect the desired singletemplateobject. Notice that to approximate this function, it suffices to determine a pair of concepts \((h_{\alpha},h)\in\mathcal{M}_{12}\) that is sufficiently good. Notice that by Lemma 8 below, we have \(\mathit{VCD}[\mathcal{M}_{12}]\leq \mathit{VCD}[\mathcal{M}_{1}(\bar{l},\overline{\varGamma},\mathcal{R}_{1},\mathcal{R}'_{1})]\times \mathit{VCD}[\mathcal{M}_{2}(\bar{l}, \overline{\varGamma}, \mathcal{R}_{2},\mathcal{R}'_{2})]\), where we define \(\mathit{VCD}[\mathcal{M}_{12}]\) as the cardinality of the largest set \(\mathbf{S}=\{v_{1},\ldots,v_{\mathbf{S}}\}\subseteq\mathcal{V}_{n}\) of sensor states such that for every vector b∈{α,0,1}^{S} \(\exists (h_{\alpha},h)\in\mathcal{M}_{12}\) such that b=(π(h _{ α }(v _{1}),h(λ(v _{1}))),…,π(h _{ α }(v _{S}),h(λ(v _{S})))).
By Lemma 5 we see that for any distribution \(\mathcal{D}(t_{y})\) and any feature binding concept μ of the scene, it suffices to construct an h that satisfies \(\mathit{error}(\lambda(\mathcal{D}'(t_{y},\mathcal{V}_{n}, P')),h,\mu)\leq \frac{c_{\mathtt{cam}}\epsilon(t_{y})}{c_{2}}\), (where P′ assigns a uniform distribution to all the positions in \(\mathcal{P}_{n_{2}}\) which also lie in volume γ(L _{ c }), and c _{2} denotes the number of such distinct feature positions lying inside γ(L _{ c }), as defined in Sect. A.3) since this h would be sufficiently accurate to detect any instance of the target singletemplateobject t _{ y } that is centred anywhere in cell j. The catch is that we have to use the uniform sensor state distribution provided by the passive sampling strategy in order to construct h. Notice that Theorem 1 only applies when the sampling distribution is the same as the distribution with respect to which the error is measured.
We overview a worst case scenario, that demonstrates the difficulties that arise by the use of the passive sampling strategy. Assume each feature is visible or nonoccluded from ϵ _{0}×100 percent of the sensor states that map to that feature, where \(0<\epsilon_{0}<\frac{1}{2}\). Also, assume \(\mathcal{D}'\) denotes a uniform distribution for the sensor states \(\mathcal{V}_{n}\). Also, assume \(\mathcal{D}(t_{y})\) assigns a uniform distribution to all the features in \(\mathcal{F}(t_{y})=\mathcal{G}_{o_{1}(t_{y})}\times\mathcal{P}_{o_{2}(t_{y})}\times\mathcal{S}_{o_{3}(t_{y})}\). Thus, if we acquire a sufficient number of examples 〈v,π(μ _{ α }(v),μ(λ(v)))〉, (where \(v\in\mathcal{D}'\)), we can use the visibility algorithm and the feature binding algorithm to construct a pair of hypotheses (h _{ α },h) satisfying \(\mathit{error}(\mathcal{D}',\pi(h_{\alpha},h(\lambda)),\pi(\mu_{\alpha}, \mu(\lambda)))\leq \frac{c_{\mathtt{cam}}\epsilon(t_{y})\epsilon_{0}}{c_{2}}\). Assume \(\mathcal{D}''\) denotes the distribution \(\mathcal{D}'\), but conditioned on the sensor states v from which the feature λ(v) is visible (i.e., for all \(v\in\mathcal{D}''\), π(μ _{ α }(v),μ(λ(v)))≠α). Since \(\mathcal{D}'\) is a uniform distribution, the probability of \(\mathcal{D}''\) returning a given sensor state v for which π(μ _{ α }(v),μ(λ(v)))≠π(h _{ α }(v),h(λ(v))), is scaled by a factor of as much as \(\frac{1}{\epsilon_{0}}\), as compared to the corresponding probability when v is sampled from distribution \(\mathcal{D}'\). Thus we see that \(\mathit{error}(\mathcal{D}',\pi(h_{\alpha},h(\lambda)), \pi(\mu_{\alpha}, \mu(\lambda)))\leq \frac{c_{\mathtt{cam}}\epsilon(t_{y})\epsilon_{0}}{c_{2}}\) implies that \(\mathit{error}(\mathcal{D}'',h(\lambda), \mu(\lambda))\leq \frac{c_{\mathtt{cam}}\epsilon(t_{y})}{c_{2}}\). But this in turn implies that for any of the c _{2} feature positions \(p'\in\mathcal{P}_{n}\) that also lie in target map cell volume γ(L _{ c }), \(\mathit{error}(\lambda(\mathcal{D}'(t_{y}, \mathcal{V}_{n},p')),h,\mu)\leq c_{\mathtt{cam}}\epsilon(t_{y})\), as wanted. Recall that \(d_{1}=\mathit{VCD}[\mathcal{M}_{1}(\bar{l},\overline{\varGamma}, \mathcal{R}_{1},\mathcal{R}'_{1})]\) and \(d_{2}=\mathit{VCD}[\mathcal{M}_{2}(\bar{l},\overline{\varGamma},\mathcal{R}_{2}, \mathcal{R}'_{2})]\). By Lemma 7, if \(m_{8}=\frac{c_{2}}{c_{\mathtt{cam}}\epsilon(t_{y})\epsilon_{0}}\log(\frac{3k}{\delta_{s}})+\frac{c_{2}d_{1}d_{2}}{c_{\mathtt{cam}}\epsilon(t_{y})\epsilon_{0}}\log(\frac{c_{2}}{c_{\mathtt{cam}}\epsilon(t_{y})\epsilon_{0}})\) and \(m_{9}=\log(\frac{3km_{8}}{\delta_{s}})T+1\), then m _{8}⋅m _{9} is a sufficient number of examples that need to be acquired with the feature detection oracle under a passive sampling strategy in order to build with confidence at least \(1\frac{\delta_{s}}{3k}\) the concept pair (h _{ α },h), and thus build the desired representation h for singletemplateobject k when its bounding cube is centred inside cell j.
However, if \(\mathcal{F}(t_{y})\neq \mathcal{G}_{o_{1}(t_{y})}\times\mathcal{P}_{o_{2}(t_{y})}\times\mathcal{S}_{o_{3}(t_{y})}\) or \(\mathcal{D}(t_{y})\) does not have a uniform distribution, an even greater, and unbounded, number of features may need to be sampled as o _{2}(t _{ y }) increases. For example, the use of polynomially related target map distributions (Definition 39) guarantees that we can train our object detector on a different scene representation distribution than the one generating our scene, thus only slowing down our algorithm by a polynomial function of r (see Definition 39). If function r is an exponential function of \(\frac{1}{\epsilon}\), \(\frac{1}{\delta}\), T, n _{1}, n _{2}, n _{3}, n _{4} instead, this would lead to an intractable slowdown in our algorithm. Similarly, and under a passive sampling strategy, if distribution \(\mathcal{D}(t_{y})\) is arbitrary (not uniform), an intractable slowdown in the number of samples could result. Alternatively, if n _{4} is upper bounded by a polynomial function \(\frac{1}{\epsilon}\), \(\frac{1}{\delta}\) and T (as n _{1}, n _{2}, n _{3} are), the function h could be approximated efficiently by a brute force approach. However, a purposive sampling strategy can find a sufficiently good h without making such an assumption. Thus we see that the constraints imposed by a passive sampling strategy, give rise to a number of problems, making the detection, localization and recognition problems significantly more difficult to solve, as compared to a purposive strategy.
Lemma 6
(VCDimension of a Target Scene Concept class)
Assume Γ is the extended sensor of sensor \(\overline{\varGamma}\), as defined in Theorem 3. Assume set F contains all target scenes of sensor Γ that are generated by a scene representation \(\mathcal{M}(\bar{l},\overline{\varGamma},\mathcal{R}_{1},\mathcal{R}_{2})\). Let F _{1} be a set of concepts such that \(\mu'_{\alpha}\in\mathbf{F}_{1}\) iff \((\mu'_{\alpha},\mu')\in\mathbf{F}\) for some concept μ′. Similarly, let F _{2} be a set of concepts such that μ′∈F _{2} iff \((\mu'_{\alpha},\mu')\in\mathbf{F}\) for some concept \(\mu'_{\alpha}\). Let \(\mathcal{M}_{1}\) be a set of concepts such that \(\mu_{\alpha}\in\mathcal{M}_{1}\) iff \((\mu_{\alpha},\mu)\in\mathcal{M}(\bar{l},\overline{\varGamma},\mathcal{R}_{1},\mathcal{R}_{2})\) for some concept μ. Similarly, let \(\mathcal{M}_{2}\) be a set of concepts such that \(\mu\in\mathcal{M}_{2}\) iff \((\mu_{\alpha},\mu)\in\mathcal{M}(\bar{l},\overline{\varGamma},\mathcal{R}_{1},\mathcal{R}_{2})\) for some concept μ _{ α }. Then \(\mathit{VCD}[\mathbf{F}_{1}]\leq \varTheta(T\cdot \mathit{VCD}[\mathcal{M}_{1}])\) and \(\mathit{VCD}[\mathbf{F}_{2}]\leq \varTheta(T\cdot \mathit{VCD}[\mathcal{M}_{2}])\). Notice that these upper bounds are tight, in that there are target maps and scene representations under which these bounds are reached.
Proof
This follows directly from Definition 25. To see this, consider T concept classes \(\mathcal{C}'_{1},\ldots,\mathcal{C}'_{T}\) which satisfy \(d=\mathit{VCD}[\mathcal{C}'_{1}]=\cdots=\mathit{VCD}[\mathcal{C}'_{T}]\), where d denotes the VCdimension of each concept class. Assume each concept in concept class \(\mathcal{C}'_{i}\) 1≤i≤T has a domain of X _{ i } that is defined such that X _{ i }∩X _{ j }=∅ and X _{ i }=X _{ j } for 1≤i<j≤T. Assuming that every set X _{ i } has a constant encoding length, the bitwise encoding length of \(\bigcup^{T}_{j=1}\mathbf{X}_{j}\) is n∈Θ(lg(T)). For every 1≤i≤T, define concept class \(\mathcal{C}''_{i}\) as consisting of concepts with domain \(\bigcup^{T}_{j=1}\mathbf{X}_{j}\), such that \(\mathcal{C}''_{i}=\mathcal{C}'_{i}\) and for every \(c''\in\mathcal{C}''_{i}\) there exists a unique concept \(c'\in\mathcal{C}'_{i}\) such that c″(x)=c′(x) if x∈X _{ i } and otherwise, if \(\mathbf{x}\not\in\mathbf{X}_{i}\), c″(x)=1. It is straightforward to see that \(\mathit{VCD}[\mathcal{C}''_{i}]=\mathit{VCD}[\mathcal{C}'_{i}]\). We also see that if we generate a concept class \(\mathcal{C}\) which contains the logical conjunction of the concepts contained in each vector of set \(\mathcal{C}''_{1}\times\mathcal{C}''_{2}\times\cdots\times \mathcal{C}''_{T}\) (i.e., \((c''_{1},\ldots,c''_{T})\in\mathcal{C}''_{1}\times\mathcal{C}''_{2}\times\cdots\times \mathcal{C}''_{T}\) iff \(c''_{1}\wedge c''_{2}\wedge\cdots\wedge c''_{T}\in\mathcal{C}\)), where the domain of each concept in \(\mathcal{C}\) is \(\bigcup^{T}_{j=1}\mathbf{X}_{j}\), then \(\mathit{VCD}[\mathcal{C}]=\varTheta(T\cdot d)\). This is easy to notice using an inductive argument. When T=1, trivially \(\mathit{VCD}[\mathcal{C}]=d\). When we double the number of cells (T=2), we see that for any set of d concept inputs which \(\mathcal{C}\) shattered when T=1, there correspond another d distinct inputs which are shattered by \(\mathcal{C}\). This implies that when T=2, \(\mathit{VCD}[\mathcal{C}]=2d\). By recursively repeating this argument as T increases, we see that \(\mathit{VCD}[\mathcal{C}]=\varTheta(T\cdot d)\). Assuming that \(d=\mathit{VCD}[\mathcal{M}_{1}]\) and \(\mathcal{C}=\mathbf{F}_{1}\), or, \(d=\mathit{VCD}[\mathcal{M}_{2}]\) and \(\mathcal{C}=\mathbf{F}_{2}\), this proves the theorem. Notice, that, if X _{ i }∩X _{ j }≠∅, potentially a smaller number of concept inputs are shattered by \(\mathcal{C}\) as T increases, indicating that Θ(T⋅d) is an upper bound on \(\mathit{VCD}[\mathcal{C}]\). □
The lemma above shows that in general, for any scene generated by a particular scene representation, we would need to sample a linearly increasing number of samples with respect to T, in order to guarantee that we have PAClearned a target scene. These arguments assume that any scene we encounter in practice is “translation invariant”, in that the complexity of any local scene region is on average equal to the complexity of any other local scene region present in the scene representation class. This brings up the question of how an increasing representation length of the scene, might affect our detection algorithm, since the representation length increases with T. As we discuss in Sect. 4, when trying to satisfy a computational cost constraint, there are certain advantages in using compact representations as input to our object detector. This will underscore the importance of a divideandconquer approach to the localization problem, where we localize objects by building representation for smaller subsets of the scene, rather than constructing a single representation of the entire scene and analyzing it in “oneshot” to localize or detect the object. We discuss this more in Sects. 4 and 5.
Lemma 7
Let \(0<\delta<\frac{1}{2}\). Assume we are given a passive sampling strategy, defined over a target map with nonoverlapping cell search spaces. If we call the active sensor at least \(n=\log(\frac{1}{\delta})T+1\) times, we are guaranteed with probability at least 1−δ, that we will have acquired a sample from inside a single given target map cell at least once.
Proof
By the geometric distribution, and since the active sensor samples the sensor states uniformly, the probability that after n iterations we sample from inside a given target map cell for the first time, is \((\frac{T1}{T})^{n1}\frac{1}{T}\), which decreases monotonically as n increases. But notice that \((\frac{T1}{T})^{n1}\frac{1}{T}=\delta\) iff \(n1=\frac{\log(\frac{1}{\delta}\frac{1}{T})}{\log(\frac{T}{T1})}\). Notice that n is bounded by \(\log(\frac{1}{\delta})T+1\), which means that a polynomial number of samples with respect to \(\frac{1}{\delta}\) and T suffice to be sampled to satisfy the lemma. □
Lemma 8
Let \(\mathcal{F}=\mathcal{G}_{n_{1}}\times\mathcal{P}_{n_{2}}\times\mathcal{S}_{n_{3}}\) and \(\mathcal{V}_{n}=\mathcal{G}_{n_{1}}\times\mathcal{P}_{n_{2}}\times\mathcal{S}_{n_{3}}\times\overline{\mathcal{V}}_{n_{4}}\). Given a concept class \(\mathcal{C}\) where each concept \(c\in\mathcal{C}\) is of the form \(c:\mathcal{F}\rightarrow\{0,1\}\), if we define a new concept class \(\mathcal{C}_{\lambda}\) such that \(c\in\mathcal{C}\) iff \(c(\lambda)\in\mathcal{C}_{\lambda}\), where \(\lambda:\mathcal{V}_{n}\rightarrow\mathcal{F}\) is a surjective function, then the VCdimensions of \(\mathcal{C}\) and \(\mathcal{C}_{\lambda}\) are identical.
Proof
By Definition 25, it is straightforward to see that the VCdimension of \(\mathcal{C}_{\lambda}\) is at least as large as the VCdimension of \(\mathcal{C}\). We need to prove that the VCdimension of \(\mathcal{C}_{\lambda}\) is less than or equal to \(\mathcal{C}\). Assume the opposite. That is, the VCdimension of \(\mathcal{C}_{\lambda}\) is greater than that of \(\mathcal{C}\). Assume \(S=\{v_{1},v_{2},\ldots,v_{S}\}\subseteq\mathcal{V}_{n}\) is the largest set of distinct sensor states that are shattered by \(\mathcal{C}_{\lambda}\), or in other words \(S=\mathit{VCD}[\mathcal{C}_{\lambda}]\). But this is equivalent to saying that there exist 2^{S} distinct concepts in \(\mathcal{C}\) which shatter {λ(v _{1}),λ(v _{2}),…,λ(v _{S})}. This implies that for all i,j∈{1,…,S} (where i≠j), λ(v _{ i })≠λ(v _{ j }). But by the definition of a VCdimension, this implies that \(\mathit{VCD}[\mathcal{C}]\geq S=\mathit{VCD}[\mathcal{C}_{\lambda}]\), a contradiction. □
Rights and permissions
About this article
Cite this article
Andreopoulos, A., Tsotsos, J.K. A Computational Learning Theory of Active Object Recognition Under Uncertainty. Int J Comput Vis 101, 95–142 (2013). https://doi.org/10.1007/s1126301205516
Received:
Accepted:
Published:
Issue Date:
Keywords
 Object recognition
 Visual search
 Active vision
 Attention
 Computational complexity of vision