Abstract
In this paper, we extend several existing methods that apply distance function learning to regression problems. We discover that these methods may be viewed as approximating a matrix consisting of desired distances among all training samples. Based on this understanding, we propose an iterative framework where outlier samples are corrected by their neighbors via asymptotically increasing the correlation coefficients between the desired distances and the distances of sample labels. Moreover, using this framework, we find that most existing methods iterate only once. As another extension, we adopt a nonlinear distance function and approximate it with neural network. For a fair comparison, we conduct an experiment on age estimation from face images as a regression problem, and the results are comparable to the state of the art.
Similar content being viewed by others
Notes
Manifold learning assumes that data are homogenously sampled [3], as another words, the data lie on or close to a low-dimensional manifold embedded in the ambient space. For most applications, the data is generated by continuously varying a set of parameters.
However, non-trivial extensions are possible, e.g. Taylor et al. [32] extended the Neighborhood Component Analysis (NCA) to the regression setting.
Strictly speaking, the distance functions proposed in [3, 17] can not be referred to as a metric, because they do not satisfy the triangle inequality as one of the metric axioms. Instead, they should be called non-metric distance or semi-metrics, in conformance to most existing literature such as [31]. We will discuss more on this in Section 4.
Other NN topologies with similar size only lead to a slight performance difference. Investigation of the optimal topology is a pure machine learning problem, which is out of the scope of this work. Here we only present a good network configuration, but its optimality is not guaranteed.
It is referred to as Mean Square Error (MSE) in the context of NN.
Also the training labels are integers owing to the limitation of dataset collection, but any intermediate label values and the final predicted labels are real numbers
Here, the dimensionality of a regressor refers to the Vapnik–Chervonenkis dimension based complexity, see [4] for details.
\( \widehat{d}\left(i,j\right)={\left(\frac{\left|L\left(i,j\right)\right|}{C-\left|L\left(i,j\right)\right|}\right)}^p\times d\left(i,j\right) \), where L(i,j) is the label difference between two data. C is a constant greater than any label value in the train set which ensures the denominator to be greater than zero. p is selected to be 2 to make data easier to discriminate. d(i, j) is the Euclidean distance between two samples X i and X j .
\( \widehat{d}\left(i,j\right)={\left(\frac{\left|L\left(i,j\right)\right|+\upgamma}{C-\left|L\left(i,j\right)\right|}\right)}^p\times d\left(i,j\right) \), where L(i,j) is the absolute label difference between two data. γ refers to the labeling noise, more specifically, a human face image labeled as 7 years actually ranges within 7–8 years, so the labeling noise is 1 year in this case. C = max L(i,j) + ε, ε > 0 ensuring the denominator not to be zero. p = 2 is selected to make data easier to discriminate. The meaning of d(i, j) is the same as in Eq.(27) in [17] (see the previous footnote).
\( \widehat{d}\left(i,j\right)=\left\{\begin{array}{c}\hfill \frac{\upalpha \left(L\left(i,j\right)\right)}{C-L\left(i,j\right)}\times d\left(i,j\right)L\left(i,j\right)\ne 0\hfill \\ {}\hfill 0L\left(i,j\right)=0\hfill \end{array}\right. \), where the function α(∙) is directly proportional to the label distance (in this case, the pose distance). The meanings of L(i,j), d(i, j) and C are the same as in Eq.(2) in [3] (see the previous footnote).
An obvious counterexample is to combine two three-point metrics both with d(a,b) = 1, d(b,c) = 1, d(a,c) = 2.
It is scaled so that the mean of such Euclidean distance equals to the mean of id ij . Note that, distance itself is first order derivative and we do not need to scale according to its variance.
Particularly, δ(dd,ad) = δ(dd,ad′) implies that, the NN is not updated and ad = ad′. Denote dd* as the desired distance in the next iteration, then dd* = dd, the iterative algorithm has converged to ad′ already.
References
Balasubramanian VN, Ye J, Panchanathan S (2007) Biased manifold embedding: A framework for person-independent head pose estimation, Proc. CVPR, pp.1–7
Bar-Hillel AD (2007) Weinshall, Learning distance function by coding similarity, Proc. ICML, pp.65–72
Castillo E, Berdinas BG, Romero OF, Betanzos AA (2006) A very fast learning method for neural networks based on sensitivity analysis. J Mach Learn Res 7:1159–1182
Cherkassky V, Shao X, Mulier FM, Vapnik VN (1999) Model complexity control for regression using VC generalization bounds. IEEE Trans Neural Netw 10(5):1075–1089
Chopra S, Hadsell R, LeCun Y (2005) Learning a similarity metric discriminatively, with application to face verification, Proc. CVPR, pp.539–546
Cootes TF, Edwards GJ, Taylor CJ (2001) Active appearance models. IEEE Trans on PAMI 23(6):681–685
Davis JV, Kulis B, Jain P, Sra V, Dhillon IS (2007) Information-theoretic metric learning, Proc. ICML, pp.209–216
Fan N (2011) Learning nonlinear distance functions using neural network for regression with application to robust human age estimation, Proc. ICCV, pp.249–254
FG-NET Aging Database, http://www.fgnet.rsunit.com
Geng X, Miles KS, Zhou ZZ (2008) Facial age estimation by nonlinear aging pattern subspace, Proc. ACM Multimedia, pp.721–724
Geng X, Zhou ZH, Miles KS (2007) Automatic age estimation based on facial aging patterns. IEEE Trans PAMI 29(12)):2234–2240
Goldberger J, Roweis S, Hinton G, Salakhutdinov R (2005) Neighbourhood components analysis, Proc. NIPS, pp.513–520
Guo GD, Mu G, Fu Y, Dyer C, Huang TS (2009) A study on automatic age estimation using a large database, Proc. ICCV, pp.1–8
Guo GD, Mu G, Fu Y, Huang (2009) Human age estimation using bio-inspired features, Proc. CVPR, pp.1–8
He X, Ma WY, Zhang HJ (2004) Learning an image manifold for retrieval, Proc. ACM Multimedia, pp.17–23
Huang YZ, Long YJ (2008) Demosaicking recognition with applications in digital photo authentication based on a quadratic pixel correlation model, Proceedings of CVPR, pp.1–8
Jin C, Long YJ (2010) On label information incorporated metric learning for regressions. Int J Comput Intell Appl 9(4):339–351
Lanitis A, Draganova C, Christodoulou C (2004) Comparing different classifiers for automatic age estimation. IEEE Trans SMC-B 34(1):621–628
Long YJ, Huang YZ (2006) Image based source camera identification using demosaicking, Proceedings of the 8th International conference on Workshop Multimedia Signal Processing, pp. 419–424
Macskassy SA, Hirsh H, Banerjee A, Dayanik AA (2003) Converting numerical classification into text classification. Artif Intell 143(1):51–77
McCullagh P (1980) Regression models for ordinal data. J R Stat Soc Ser B 42(2):109–142
Min R, van der Maaten LJP, Yuan Z, Bonner A, Zhang Z (2010) Deep supervised t-distributed embedding, Proc. ICML, pp.791–798
Moller AF (1993) A scaled conjugate gradient algorithm for fast supervised learning. Neural Netw 6(4):525–533
Pan (2010) Human age estimation by metric learning for regression problems. Proc. EMM CVPR, pp. 455–465
Ramanathan N, Chellappa R, Biswas S (2009) Age progression in human faces: a survey. J Vis Lang Comput 20:131–144
Salakhutdinov R, Hinton G (2007) Learning a nonlinear embedding by preserving class neighbourhood structure, Proc. AI and Statistics, pp. 412–419
Shalev-Shwartz S, Singer Y, Ng AY (2004) Online and batch learning of pseudo-metrics, Proc. ICML, pp.743–750
Shental N, Hertz T, Weinshall D, Pavel M (2002) Adjustment learning and relevant component analysis, Proc. ECCV, pp.776–792
Smith L (2002) A tutorial on Principal Components Analysis http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
Stanley KO (2007) Compositional pattern producing networks: a novel abstraction of development. Genet Program Evolvable Mach 8(2):131–162
Tan X, Chen S, Li J, Zhou Z (2006) Learning non-metric partial similarity based on maximal margin criterion, Proc. CVPR, pp.138–145
Taylor G, Fergus R, Williams G, Spiro I, Bregler C (2010) Pose-sensitive embedding by nonlinear NCA regression. Proc, NIPS
Weinberger K, Blitzer J, Saul L (2006) Distance metric learning for large margin nearest neighbor classification, Proc. NIPS, pp.1475–1482
Xing E, Ng A, Jordan MI, Russell S (2002) Distance metric learning with application to clustering with side-information, Proc. NIPS, pp.505–512
Yan S, Wang H, Huang TS, Tang X (2007) Ranking with uncertain labels, Proc. ICME, pp.96–99
Yan S, Wang H, Tang X, Huang T (2007) Learning auto-structured regressor from uncertain nonnegative labels. Proc. ICCV, pp.1–8
Yan S, Zhou X, M. Liu, M. H. Johnson, T. Huang (2008) Regression from patch-kernel, Proc. CVPR, pp.1–8
Yang L, Jin R (2006) Distance metric learning: a comprehensive survey, Technical report, Michigan State University. http://www.cs.cmu.edu/~liuy/frame_survey_v2.pdf
Yeung DY, Chang H (2007) A kernel approach for semi-supervised metric learning. IEEE Trans on Neural Net 18(1):141–149
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chen, J., Zeng, H. & Fan, N. Nonlinear distance function learning using neural network: an iterative framework. Multimed Tools Appl 74, 671–688 (2015). https://doi.org/10.1007/s11042-014-1944-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-014-1944-z