Skip to main content

Active Rare Class Discovery and Classification Using Dirichlet Processes


Classification is used to solve countless problems. Many real world computer vision problems, such as visual surveillance, contain uninteresting but common classes alongside interesting but rare classes. The rare classes are often unknown, and need to be discovered whilst training a classifier. Given a data set active learning selects the members within it to be labelled for the purpose of constructing a classifier, optimising the choice to get the best classifier for the least amount of effort. We propose an active learning method for scenarios with unknown, rare classes, where the problems of classification and rare class discovery need to be tackled jointly. By assuming a non-parametric prior on the data the goals of new class discovery and classification refinement are automatically balanced, without any tunable parameters. The ability to work with any specific classifier is maintained, so it may be used with the technique most appropriate for the problem at hand. Results are provided for a large variety of problems, demonstrating superior performance.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8


  1. An implementation is available from

  2. Sometimes referred to as passive learning.

  3. It is not really solvable after this, as the classes have a lot of overlap, but it is sufficient to illustrate the inner workings of the presented approach, whilst reducing it to \(1D\) allows for a clean visualisation.

  4. We set this classifier parameter to 32.

  5. A density estimate that we hallucinate is the prior for the classification algorithm.

  6. Note that concentration cannot be calculated until at least two classes have been found, hence the jump in the graph at that time.

  7. 32 in all cases except for kdd99 and faces, where it is 24 and 16, respectively due to their size.

  8. Balanced inlier rate is calculated as the average inlier rate for each class in the training set. Inlier rate is the number of correct classifications divided by the number of exemplars being classified. This can be interpreted as recall generalised to 3+ classes.

  9. Note that culling is for the entire data set, whilst separation into training and testing was purely random, so classes can have \(<\)10 entries in the pool.

  10. Whilst this strategy can always beat the presented approach it does so by introducing a scale parameter, which has to be selected for each problem. This is inappropriate, as doing multiple runs to find the best parameter obviates the entire purpose of active learning.

  11. There is even an interesting human interface issue of presenting such priors to a non-expert, such that they can communicate what they already know about a specific problem.

  12. With KDE this is obtained by training several classifiers on bootstrap samples from the training set. This achieves the goal of measuring model variance, but damages performance, so a fully trained version is kept to do actual classification.


  • Abe, N., & Mamitsuka, H. (1998). Query learning strategies using boosting and bagging. In Proceedings of the ICML (pp. 1–9).

  • Angluin, D. (1988). Queries and concept learning. Machine Learning, 2(4), 319–342.

    Google Scholar 

  • Blackwell, D., & MacQueen, J. B. (1973). Ferguson distributions via Polya urn schemes. Annals of Statistics, 1(2), 353–355.

    Article  MATH  MathSciNet  Google Scholar 

  • Boser, B. E., Guyon, I., & Vapnik, V. (1992). A training algorithm for optimal margin classifiers. Computational Learning Theory, 5, 144–152.

    Google Scholar 

  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

    Article  MATH  Google Scholar 

  • Cohn, D., Atlas, L., & Ladner, R. (1994). Improving generalization with active learning. Machine Learning, 15(2), 201–221.

    Google Scholar 

  • Culotta, A., & McCallum, A. (2005). Reducing labeling effort for structured prediction tasks. In Proceedings of the National Conference Artificial Intelligence (pp. 746–751).

  • Dagan, I., & Engelson, S. (1995). Committee-based sampling for training probabilistic classifiers. In: Proceedings of the ICML (pp. 150–157).

  • Dupuit, J. (1952). On the measurement of the utility of public works. International Economic Papers, 2, 83–110.

    Google Scholar 

  • Escobar, M. D., & West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of American Statistical Association, 90(430), 577–588.

    Article  MATH  MathSciNet  Google Scholar 

  • Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2), 209–230.

    Article  MATH  MathSciNet  Google Scholar 

  • Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179–188.

    Article  Google Scholar 

  • Frank, A., & Asuncion, A. (2010). UCI machine learning repository.

  • Geman, S., & Geman, D. (1984). Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. PAMI, 6(6), 721–741.

    Article  MATH  Google Scholar 

  • Guillaumin, M., Verbeek, J., & Schmid, C. (2009). Is that you? Metric learning approaches for face identification. In Proceedings of the ICCV.

  • Haines, T. S. F., & Xiang, T. (2011). Active learning using dirichlet processes for rare class discovery and classification. In Proceedings of the BMVC.

  • Han, J., & Bhanu, B. (2006). Individual recognition using gait energy image. Pattern Analysis and Machine Intelligence, 28(2), 316–322.

    Article  Google Scholar 

  • He, J., & Carbonell, J. G. (2007). Nearest-neighbor-based active learning for rare category detection. Neural Information Processing Systems, 21.

  • Ho, T. K. (1995). Random decision forests. Proceedings of the Document Analysis and Recognition, 1, 278–282.

    Google Scholar 

  • Hodge, V., & Austin, J. (2004). A survey of outlier detection methodologies. Artificial Intelligence Review, 22(2), 85–126.

    Article  MATH  Google Scholar 

  • Hospedales, T. M., Gong, S., & Xiang, T. (2011). Finding rare classes: Adapting generative and discriminative models in active learning. Pacific-Asia Conference on Knowledge Discovery and Data Mining, 15.

  • Huang, G. B., Ramesh, M., Berg, T., & Learned-Miller, E. (2007). Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07–49, University of Massachusetts, Amherst.

  • Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237–285.

    Google Scholar 

  • Ladický, L., Russell, C., Kohli, P., & Torr, P. H. S. (2009). Associative hierarchical crfs for object class image segmentation. ICCV, 12, 739–746.

    Google Scholar 

  • Lee, Y. J., & Grauman, K. (2010). Object-graphs for context aware category discovery. In Proceedings of the CVPR

  • Lewis, D. D., & Gale, W. A. (1994). A sequential algorithm for training text classifiers. Proc. Conf. on Research and Development in. Information Retrieval, 17, 3–12.

  • Loy, C. C., Hospedales, T. M., Xiang, T., & Gong, S. (2012). Stream-based joint exploration-exploitation active learning. In Proceedings of the CVPR.

  • MacKay, D. J. C. (1992). Information-based objective functions for active data selection. Neural Computation, 4(4), 590–604.

    Article  Google Scholar 

  • Maloof, M. A., Langley, P., Binford, T. O., Nevatia, R., & Sage, S. (2003). Improved rooftop detection in aerial images with mahcine learning. Machine Learning, 53, 157–191.

    Article  Google Scholar 

  • McCallum, A., & Nigam, K. (1998). Employing em in pool-based active learning for text classification. In Proceedings of the ICML (pp. 359–367).

  • Mitchell, T. M. (1982). Generalization as search. Artificial Intelligence, 18, 203–226.

    Article  MathSciNet  Google Scholar 

  • Olsson, F. (2009). A literature survey of active machine learning in the context of natural language processing. Technical Report T2009:06, Swedish Institute of Computer Science.

  • Pelleg, D., & Moore, A. (2004). Active learning for anomaly and rare-category detection. Advances in Neural Information Processing Systems, 17, 1073–1080.

    Google Scholar 

  • Picard, R. W., & Minka, T. P. (1995). Vision texture for annotation. Multimedia Systems, 3(1), 3–14.

    Article  Google Scholar 

  • Platt, J. C. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in large margin classifiers (pp. 61–74).

  • Rother, C., Kolmogorov, V., & Blake, A. (2004). grabcut interactive foreground extraction using iterated graph cuts. SIGGRAPH, 23(3), 309–314.

    Article  Google Scholar 

  • Roy, N., & McCallum, A. (2001). Toward optimal active learning through sampling estimation of error reduction. In Proceedings of the ICML (pp. 441–448).

  • Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4, 639–650.

    MATH  MathSciNet  Google Scholar 

  • Settles, B. (2009). Active learning literature survey. Technical Report 1648, University of Wisconsin-Madison.

  • Settles, B., Craven, M., & Ray, S. (2008). Multiple-instance active learning. NIPS, 20, 1289–1296.

    Google Scholar 

  • Seung, H. S., Opper, M., & Sompolinsky, H. (1992). Query by committee. Proceedings of the Workshop on Computational Learning Theory, 5, 287–294.

    Google Scholar 

  • Sillito, R. R., & Fisher, R. B. (2007). Incremental one-class learning with bounded computational complexity. International Conference on Artificial Neural Networks, 17, 58–67.

    Google Scholar 

  • Stokes, J. W., Platt, J. C., Kravis, J., & Shilman, M. (2008). ALADIN: Active learning of anomalies to detect intrusion. Technical Report 2008–2024, Microsoft Research.

  • Teh, Y. W., & Jordan, M. I. (2010). Bayesian nonparametrics, chapter hierarchical Bayesian nonparametric models with applications. Cambridge: Cambridge University Press.

  • Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of American Statistical Association, 101(476), 1566–1581.

    Article  MATH  MathSciNet  Google Scholar 

  • Tong, S., & Koller, D. (2000). Support vector machine active learning with applications to text classification. ICML, 2, 45–66.

    Google Scholar 

  • Vatturi, P., & Wong, W.-K. (2009). Category detection using hierarchical mean shift. Knowledge Discovery and Data mining, 15, 847–856.

    Google Scholar 

  • Vlachos, A., Ghahramani, Z., & Briscoe, T. (2010). Active learning for constrained dirichlet process mixture models. In Proceedings of the 2010 Workshop on GEometrical Models of, Natural Language Semantics (pp. 57–61).

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Tom S. F. Haines.

Appendix: Alternative Choices

Appendix: Alternative Choices

We now discuss some of the alternatives to the presented approach that were considered. Firstly, as discussed in Sect. 3.3, one variant lead to an improvement, specifically soft selection over hard selection. Soft selection can be taken further—a parameter can be introduced as a power of the \(P(\mathrm{wrong })\) value, to emphasis or de-emphasis large values. This can be tuned to get better results, but as a problem specific parameter it is of no value to active learning, as parameter tuning is incompatible with a single set of queries. The KDE variant in Fig. 6 is similar, except its parameter is fatally sensitive.

The probability of being wrong can be interpreted as an expectation over zero-one loss—alternative loss functions can be considered. Hinge loss for a multinomial distribution can be defined as the difference between the probability of the correct answer and the highest probability in the distribution, which is 0 if the correct answer has the greatest probability. It often undermined performance however.

Query by committee (QBC) was explored by Loy et al. (2012); however, their formulation really served as a probabilistic selection threshold function. Using multiple models it can be formulated to measure variance, so that \(P(\mathrm{wrong })\) also focuses on areas with high model uncertainty.Footnote 12 Noting that there are two estimates—an estimate of what the actual class membership is, including the possibility of being something new, and an estimate of what the classifier is going to assign, we can use different models from a committee for these two roles. A QBC variant can then be defined using a committee where all assignment combinations are summed out, so a high QBC \(P(\mathrm{wrong })\) score is obtained at boundaries between classes, in areas where new classes could be found, and where the current model has high uncertainty. This unfortunately resulted in too much emphasis being placed on boundary refinement.

For some problems the above variants are better. The issue is there is no way to predict which problems in advance, and for some problems they are much worse. Active learning is a scenario where you choose a method and apply it to your problem once—multiple runs require that the queries for each be satisfied, which is contrary to the goal. We therefore present \(P(\mathrm{wrong })\) as formulated, as it is consistent—it never performs poorly, and usually gives top tier performance. Future work could consider inferring which data sets work best with different active learners.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Haines, T.S.F., Xiang, T. Active Rare Class Discovery and Classification Using Dirichlet Processes. Int J Comput Vis 106, 315–331 (2014).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Active learning
  • Rare class discovery
  • Classification