Learning Prior Bias in Classifier
Abstract
In pattern classification, a classifier is generally composed both of feature (vector) mapping and bias. While the mapping function for features is formulated in either a linear or a nonlinear (kernelbased) form, the bias is simply represented by a constant scalar value, rendering prior information on class probabilities. In this paper, by focusing on the prior bias embedded in the classifier, we propose a novel method to discriminatively learn not only the feature mapping function but also the prior bias based on the extra prior information assigned to samples other than the class category, e.g., the 2D position where the local image feature is extracted. Without imposing specific probabilistic models, the proposed method is formulated in the framework of maximum margin to adaptively optimize the biases, improving the classification performance. We present a computationally efficient optimization approach for making the method applicable even to largescale data. The experimental results on patch labeling in the onboard camera images demonstrate the favorable performance of the proposed method in terms of both classification accuracy and computation time.
Keywords
Pattern classification Bias Discriminative learning SVM1 Introduction
Prior information has been effectively exploited in the fields of computer vision and machine learning, such as for shape matching [1], image segmentation [2], graph inference [3], transfer learning [4] and multitask learning [5]. Learning prior has so far been addressed mainly in the probabilistic framework on the assumption that the prior is defined by a certain type of generative probabilistic model [6, 7]; especially, nonparametric Bayesian approach further considers the hyper priors of the probabilistic models [8].
Suppose samples are associated with the extra prior information \(p\in \{1,\cdots ,P\}\) as well as the class category \(c\in \{1,\cdots ,C\}\), where P and C indicate the total number of the prior types and the class categories, respectively. For instance, in the task of labeling patches on the onboard camera images, each patch (sample) is assigned with the appearance feature vector \(\varvec{x}\), the class category c and the position (extra prior information) \(p\), as shown in Fig. 1. The class category of the patch is effectively predicted by using not only the feature \(\varvec{x}\) but also the prior position \(p\) where the feature is extracted; the patches on an upper region probably belong to sky and the lower region would be road, even though the patches extracted from those two regions are both less textured, resulting in similar features.
Classification methods for cth class category. The dimensionality of the feature vector is denoted by D, \(\varvec{x}\in \mathbb {R}^D\), and the number of extra prior types is \(P\).
Method  Model  D.O.F 

simple  \(y_c=\varvec{w}_c^\top \varvec{x}+b_c\)  \(D+1\) 
proposed  \(y_c=\varvec{w}_c^\top \varvec{x}+b_c^{[p]}\)  \(D+P\) 
fullconnected  \(y_c={\varvec{w}_c^{[p]}}^\top \varvec{x}+b_c^{[p]}\)  \(PD+P\) 
2 Classifier Bias Learning
We detail the proposed method by first defining the formulation for learning the biases and then presenting the computationally efficient approach to optimize them. As we proceed to describe a general form regarding the prior biases \(p\), for better understanding, it might be helpful to refer to the task of labeling patches shown in Fig. 1; the sample is represented by the appearance feature vector \(\varvec{x}\) and the prior position \(p\in \{1,\cdots ,P\}\).
2.1 Formulation
2.2 Optimization
2.3 Trivial Biases
This gives the tight bias based on the above conditions (25, 26), which is computed by using only the samples \(\varvec{x}_i^{[p]}\) belonging to the prior \(p\).
These three ways are empirically compared in the experiments (Sect. 3.3).
2.4 Discussion
In the proposed method, all samples across all types of priors are leveraged to train the classifier, improving the generalization performance. In contrast, the fullconnected method (Table 1) treats the samples separately regarding the priors, and thus the pth classifier is learnt by using only a small amount of samples belonging to the pth type of prior, which might degrade the performance. On the other hand, the simple method learning the classifier from the whole set of samples is less discriminative without utilizing the extra prior information associated with the samples. The proposed method effectively incorporate the prior information into the classifiers via the biases which are discriminatively optimized.
The proposed method is slightly close to the crossmodal learning [19, 20]. The samples belonging to different priors are separated as if they are in different modalities, though the feature representations are the same in this case. The proposed method deals with them in a unified manner via the adaptive prior biases. Actually, the proposed method is applicable to the samples that are distributed differently across the priors; the sample distribution is shifted (translated) as \(\varvec{x}^{[q]}=\varvec{x}^{[p]}+\varvec{e}\) and the prior bias can adapt to it by \(b^{[q]}=b^{[p]}\varvec{w}^\top \varvec{e}\) since \(y^{[p]}=\varvec{w}^\top \varvec{x}^{[p]}+b^{[p]},\ {y}^{[q]} = \varvec{w}^\top \varvec{x}^{[q]}+b^{[q]}=\varvec{w}^\top \varvec{x}^{[p]}+(b^{[q]}+\varvec{w}^\top \varvec{e})=y^{[p]}\). Therefore, the samples of the different priors are effectively transferred into the optimization to improve the classification performance.
3 Experimental Results
3.1 Setting
The CamVid dataset [21] contains several sequences composed of fully labeled image frames as shown in Fig. 3: each pixel is assigned with one of 32 class labels including ‘void’. Those labeled images are captured at 10 Hz. In this experiment, we employ the major 11 labels frequently seen in the image frames, road, building, sky, tree, sidewalk, car, column pole, sign symbol, fence, pedestrian and bicyclist, to form the 11class classification task.
We extracted the GLAC image feature [22] from a local image patch of \(20\times 40\) pixels which slides at every 10 pixels over the resized image of \(480\times 360\). In this case, the feature vector \(\varvec{x}\in \mathbb {R}^{2112}\) is associated with the 2D position of the patch as the extra prior information; the total number of prior types (grid points) is \(P=1551\). Thus, the task is to categorize the patch feature vectors extracted at 1511 positions into the abovementioned 11 classes.
We used three sequences in the CamVid dataset, and partitioned each sequence into three subsequences along the time, one of which was used for training and the others were for test. This cross validation was repeated three times and the averaged classification accuracy is reported.
3.2 Computation Cost
We first evaluated the proposed method in terms of computation cost. The method trains the classifier by using all the samples across the priors, scale of which is as large as in the simple method. These methods are implemented by MATLAB on Xeon 3.4 GHz PC. In the proposed method, we apply libsvm [23] to solve QP and efficiently compute the derivatives \(G_{i,p}(\varvec{\alpha })\) required for the linear term in the objective function (11) and \(\delta ^{[p]}\) in (21) by exploiting the linear classification form as in [24]. On the other hand, two types of solvers, libsvm and liblinear [24], are applied to the simple method.
Performance comparison (%) on the ways of computing biases on trivial priors.
Tight  Mild  Extreme 

52.22  52.19  52.25 
Performance comparison.

3.3 Trivial Biases
3.4 Classification Performance
We finally compared the classification performance of the three methods, simple, fullconnected and proposed (listed in Table 1); for reference, we also apply the kernelbased extension of the proposed method by using Gaussian kernel \(\mathsf {k}(\varvec{x}_i,\varvec{x}_j)=\exp (\frac{\Vert \varvec{x}_i\varvec{x}_j\Vert }{\gamma })\) where \(\gamma \) is mean of the pairwise distances. Table 3 shows the overall performance, demonstrating that the proposed method outperforms the others. It should be noted that the fullconnected method individually applies the classifier specific to the prior \(p\in \{1,\cdots ,P\}\), requiring a plenty of memory storage and consequently taking large classification time due to loading the enormous memory. The proposed method renders as fast classification as the simple method since it enlarges only the bias. By discriminatively optimizing the biases for respective priors, the performance is significantly improved in comparison to the simple method; the improvement is especially found at the categories of car, pedestrian and bicyclist that are composed of patch parts similar to other categories but are associated with the distinct prior positions.
The kernelbased method (proposedkernel) further improves the performance on the foreground object categories, such as column pole and pedestrian. Those foreground objects exhibit large variations in appearance due to viewpoint changes and withinclass variations themselves, and the kernelbased method produces more discriminative feature mapping function compared to the linear method.
Finally, we show in Fig. 5 the biases learnt by the proposed method; the biases \(\{b^{[p]}\}_p\) are folded into the form of image frame according to the xy positions. These maps of the biases reflect the prior probability over the locations where the target category appears. These seem quite reasonable from the viewpoint of the traffic rules that cars obeys; since the CamVid dataset is collected at the Cambridge city [21], in this case, the traffic rules are of the United Kingdom. The pedestrian probably walks on the sidewalk mainly shown in the left side. The oncoming car runs on the righthand road, and the row of the building is found on the roadside. These biases are adaptively learnt from the CamVid dataset and they would be different if we use other datasets collected under different traffic rules.
4 Conclusions
We have proposed a method to discriminatively learn the prior biases in the classification. In the proposed method, for improving the classification performance, all samples are utilized to train the classifier and the input sample is adequately classified based on the prior information via the learnt biases. The proposed method is formulated in the maximummargin framework, resulting in the optimization problem of the quadratic programming form similarly to SVM. We also presented a computationally efficient approach to optimize the resultant quadratic programming along the line of sequential minimal optimization. The experimental results on the patch labeling in the onboard camera images demonstrated that the proposed method is superior in terms of classification accuracy and the computation cost. In particular, the proposed classifier operates as fast as the standard (linear) classifier, and besides the computation time for training the classifier is even faster than the SVM of the same size.
Footnotes
References
 1.Jiang, T., Jurie, F., Schmid, C.: Learning shape prior models for object matching. In: CVPR 2009, the 22nd IEEE Conference on Computer Vision and Pattern Recognition, pp. 848–855 (2009)Google Scholar
 2.ElBaz, A., Gimel’farb, G.: Robust image segmentation using learned priors. In: ICCV 2009, the 12nd International Conference on Computer Vision, pp. 857–864 (2009)Google Scholar
 3.Cremers, D., Grady, L.: Statistical priors for efficient combinatorial optimization via graph cuts. In: Pinz, A., Leonardis, A. (eds.) ECCV 2006. LNCS, vol. 3953, pp. 263–274. Springer, Heidelberg (2006) CrossRefGoogle Scholar
 4.Jie, L., Tommasi, T., Caputo, B.: Multiclass transfer learning from unconstrained priors. In: ICCV 2011, the 13th International Conference on Computer Vision, pp. 1863–1870 (2011)Google Scholar
 5.Yuan, C., Hu, W., Tian, G., Yang, S., Wang, H.: Multitask sparse learning with beta process prior for action recognition. In: CVPR 2013, the 26th IEEE Conference on Computer Vision and Pattern Recognition, pp. 423–430 (2013)Google Scholar
 6.Wang, C., Liao, X., Carin, L., Dunson, D.: Classification with incomplete data using dirichlet process priors. J. Mach. Learn. Res. 11, 3269–3311 (2010)MathSciNetzbMATHGoogle Scholar
 7.Kapoor, A., Hua, G., Akbarzadeh, A., Baker, S.: Which faces to tag: adding prior constraints into active learning. In: ICCV 2009, the 12th International Conference on Computer Vision, pp. 1058–1065 (2009)Google Scholar
 8.Ghosh, J., Ramamoorthi, R.: Bayesian Nonparametrics. Springer, Berlin (2003) zbMATHGoogle Scholar
 9.Poggio, T., Mukherjee, S., Rifkin, R., Rakhlin, A., Verri, A.: b. Technical report CBCL Paper#198/AI Memo #2001011, Massachusetts Institute of Technology, Cambridge (2001)Google Scholar
 10.Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, New York (1995) zbMATHGoogle Scholar
 11.Van Gestel, T., Suykens, J., Lanckriet, G., Lambrechts, A., De Moor, B., Vandewalle, J.: Bayesian framework for least squares support vector machine classifiers, gaussian processes and kernel fisher discriminant analysis. Neural Comput. 15, 1115–1148 (2002)CrossRefzbMATHGoogle Scholar
 12.Gao, T., Stark, M., Koller, D.: What makes a good detector? – structured priors for learning from few examples. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part V. LNCS, vol. 7576, pp. 354–367. Springer, Heidelberg (2012) CrossRefGoogle Scholar
 13.Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Berlin (2006) zbMATHGoogle Scholar
 14.Smola, A.J., Bartlett, P., Schölkopf, B., Schuurmans, D.: Advances in LargeMargin Classifiers. MIT Press, Cambridge (2000) zbMATHGoogle Scholar
 15.Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998) zbMATHGoogle Scholar
 16.Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods  Support Vector Learning, pp. 185–208. MIT Press, Cambridge (1999)Google Scholar
 17.Hsieh, C.J., Chang, K.W., Lin, C.J., Keerthi, S.S., Sundararajan, S.: A dual coordinate descent method for largescale linear svm. In: ICML 2008, the 25th International Conference on Machine Learning, pp. 408–415 (2008)Google Scholar
 18.Fan, R.E., Chen, P.H., Lin, C.J.: Working set selection using second order information for training support vector machines. J. Mach. Learn. Res. 6, 1889–1918 (2005)MathSciNetzbMATHGoogle Scholar
 19.Kan, M., Shan, S., Zhang, H., Lao, S.: Multiview discriminant analysis. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part I. LNCS, vol. 7572, pp. 808–821. Springer, Heidelberg (2012) CrossRefGoogle Scholar
 20.Sharma, A., Jacobs, D.: Bypassing synthesis: pls for face recognition with pose, lowresolution and sketch. In: CVPR 2011, the 24th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 593–600 (2011)Google Scholar
 21.Fauqueur, J., Brostow, G.J., Shotton, J., Cipolla, R.: Segmentation and recognition using structure from motion point clouds. In: Torr, P., Forsyth, D., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 44–57. Springer, Heidelberg (2008) CrossRefGoogle Scholar
 22.Kobayashi, T., Otsu, N.: Image feature extraction using gradient local autocorrelations. In: Torr, P., Forsyth, D., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 346–358. Springer, Heidelberg (2008) CrossRefGoogle Scholar
 23.Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001). Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
 24.Fan, R., Chang, K., Hsieh, C., Wang, X., Lin, C.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008). Software available at http://www.csie.ntu.edu.tw/~cjlin/liblinear
 25.Joachims, T.: Making largescale svm learning practical. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods  Support Vector Learning, pp. 169–184. MIT Press, Cambridge (1999)Google Scholar