## Abstract

Understanding how images of objects and scenes behave in response to specific egomotions is a crucial aspect of proper visual development, yet existing visual learning methods are conspicuously disconnected from the physical source of their images. We propose a new “embodied” visual learning paradigm, exploiting proprioceptive motor signals to train visual representations from egocentric video with no manual supervision. Specifically, we enforce that our learned features exhibit equivariance i.e., they respond predictably to transformations associated with distinct egomotions. With three datasets, we show that our unsupervised feature learning approach significantly outperforms previous approaches on visual recognition and next-best-view prediction tasks. In the most challenging test, we show that features learned from video captured on an autonomous driving platform improve large-scale scene recognition in static images from a disjoint domain.

This is a preview of subscription content, access via your institution.

## Notes

- 1.
Depending on the context, the motor activity could correspond to either the 6-DOF egomotion of the observer moving in the scene or the second-hand motion of an object being actively manipulated, e.g., by a person or robot’s end effectors.

- 2.
One could attempt to apply our idea using camera poses inferred from the video itself (i.e., with structure from motion). However, there are conceptual and practical advantages to relying instead on external sensor data capturing egomotion. First, the sensor data, when available, is much more efficient to obtain and can be more reliable. Second, the use of an external sensor parallels the desired effect of the agent learning from its proprioception motor signals, as opposed to bootstrapping the visual learning process from a previously defined visual odometry module based on the same visual input stream.

- 3.
- 4.
Bias dimension assumed to be included in

*D*for notational simplicity. - 5.
- 6.
Note that we do not test equiv (non-discrete) i.e., the non-discrete formulation of Eq. (13) in isolation. This is because Eq. (13) specifies a non-contrastive loss that would result in collapsed feature spaces (such as \({\mathbf {z}}_{\varvec{\theta }}=\varvec{0}\forall \varvec{x}\)) if optimized in isolation. To overcome this deficiency, we optimize this non-discrete equivariance loss only in conjunction with the contrastive drlim loss in the equiv + drlim (non-discrete) approach.

- 7.
For fairness, the training frame pairs for each method are drawn from the same starting set of KITTI training videos.

- 8.
Note that while the number of frame pairs \(N_u\) may be different for different methods, all methods have access to the same training videos, so this is a fair comparison. The differences in \(N_u\) are due to the methods themselves. For example, on KITTI data, equiv selectively uses frame pairs corresponding to large motions (Sect. 3.1), so even given the same starting videos, it is restricted to using a smaller number of frame pairs than drlim and temporal

- 9.
We do not use a straightforward fully connected layer Linear(4096,4096) as this would drastically increase the number of network parameters, and possibly cause overfitting of \(M_g\), backpropagating poor equivariance regularization gradients through to the base network \({\mathbf {z}}_{\varvec{\theta }}\). The \(M_g\) architecture we use in its place is non-linear due to the ReLU units.

- 10.
For uniformity, we do the same recovery of \(M_g^\prime \) for our method; our results are similar either way.

- 11.
Note that in our model of equivariance, this is not strictly true, since the pair-wise difference vector \(M_g\mathbf {z_{\varvec{\theta }}}(\varvec{x})-\mathbf {z_{\varvec{\theta }}}(\varvec{x})\) need not actually be consistent across images \(\varvec{x}\). However, for small motions and linear maps \(M_g\), this still holds approximately, as we show empirically.

- 12.
- 13.
We observed in Sect. 4.4 that performance dropped at higher layers, indicating that AlexNet model complexity might be too high.

- 14.
To verify the clsnet baseline is legitimate, we also ran a Tiny Image nearest neighbor baseline on SUN as in Xiao et al. (2010). It achieves 0.61% accuracy (worse than clsnet, which achieves 0.70%).

- 15.
Technically, the equiv objective in Eq. (7) may benefit from setting different margins corresponding to the different egomotion patterns, but we overlook this in favor of scalability and fewer hyperparameters.

## References

Agrawal, P., Carreira, J., & Malik, J. (2015). Learning to see by moving. In

*ICCV*.Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L. D., Monfort, M., Muller, U., & Zhang, J., et al. (2016). End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316.

Bromley, J., Bentz, J. W., Bottou, L., Guyon, I., LeCun, Y., Moore, C., Säckinger, E., & Shah, R. (1993). Signature verification using a Siamese time delay neural network. In

*IJPRAI*.Cadieu, C. F., & Olshausen, B. A. (2012). Learning intermediate-level representations of form and motion from natural movies.

*Neural Computation*,*24*, 827–866.Chen, C., & Grauman, K. (2013). Watching unlabeled videos helps learn new human actions from very few labeled snapshots. In

*CVPR*.Cohen, T. S., & Welling, M. (2015). Transformation properties of learned visual representations. In

*ICLR*.Cuda-convnet. https://code.google.com/p/cuda-convnet/.

Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In

*CVPR*.Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In

*IEEE Conference on computer vision and pattern recognition, CVPR 2009*.Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2014). Decaf: A deep convolutional activation feature for generic visual recognition. In

*ICML*.Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., & Brox, T. (2014). Discriminative unsupervised feature learning with convolutional neural networks. In

*NIPS*.Gao, R., Jayaraman, D., & Grauman, K. (2016). Object-centric representation learning from unlabeled videos. In

*ACCV*.Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? The KITTI vision benchmark suite. In

*CVPR*.Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The KITTI Dataset. In

*IJRR*.Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In

*AISTATS*.Goroshin, R., Bruna, J., Tompson, J., Eigen, D., & LeCun, Y. (2015). Unsupervised learning of spatiotemporally coherent metrics. In

*ICCV*.Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. In

*CVPR*.Held, R., & Hein, A. (1963). Movement-produced stimulation in the development of visually guided behavior.

*Journal of Comparative and Physiological Psychology*,*56*, 872.Hinton, G. E., Krizhevsky, A., & Wang, S. D. (2011). Transforming auto-encoders. In

*ICANN*.Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In

*ICML*.Jayaraman, D., & Grauman, K. (2015). Learning image representations tied to egomotion. In

*ICCV*.Jayaraman, D., & Grauman, K. (2016). Look-ahead before you leap: End-to-end active recognition by forecasting the effect of motion. In

*ECCV*.Jayaraman, D., & Grauman, K. (2016). Slow and steady feature analysis: Higher order temporal coherence in video. In

*CVPR*.Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. arXiv.

Kivinen, J. J., & Williams, C. K. (2011). Transformation equivariant Boltzmann machines. In

*ICANN*.Kornhauser, C. C. A. S. A., & Xiao, J. (2015). Deepdriving: Learning affordance for direct perception in autonomous driving. In

*ICCV*.Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In:

*NIPS*.Kulkarni, T. D., Whitney, W., Kohli, P., & Tenenbaum, J. B. (2015). Deep convolutional inverse graphics network. In

*NIPS*.LeCun, Y., Huang, F. J., & Bottou, L. (2004). Learning methods for generic object recognition with invariance to pose and lighting. In

*CVPR*.Lenc, K., & Vedaldi, A. (2015). Understanding image representations by measuring their equivariance and equivalence. In

*CVPR*.Levine, S., Finn, C., Darrell, T., & Abbeel, P. (2015). End-to-end training of deep visuomotor policies. arXiv preprint arXiv:1504.00702.

Li, Y., Fathi, A., & Rehg, J. M. (2013). Learning to predict gaze in egocentric video. In

*ICCV*.Lies, J. P., Häfner, R. M., & Bethge, M. (2014). Slowness and sparseness have diverging effects on complex cell learning.

*PLoS Computational Biology, 10*(3), e1003468.Lowe, D. (1999). Object recognition from local scale-invariant features. In

*ICCV*.Memisevic, R. (2013). Learning to relate images. In

*PAMI*.Michalski, V., Memisevic, R., & Konda, K. (2014). Modeling deep temporal dependencies with recurrent grammar cells. In

*NIPS*.Mobahi, H., Collobert, R., & Weston, J. (2009). Deep Learning from temporal coherence in video. In

*ICML*.Nakamura, T., & Asada, M. (1995). Motion sketch: Acquisition of visual motion guided behaviors. In

*IJCAI*.Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., & Chopra, S. (2014). Video (language) modeling: A baseline for generative models of natural videos. arXiv.

Ren, X., & Gu, C. (2010). Figure-ground segmentation improves handled object recognition in egocentric video. In

*CVPR*.Schmidt, U., & Roth, S. (2012). Learning rotation-aware features: From invariant priors to equivariant descriptors. In

*CVPR*.Simard, P., LeCun, Y., Denker, J., & Victorri, B. (1998). Transformation invariance in pattern recognition—tangent distance and tangent propagation.

*In Neural networks: Tricks of the trade*(pp. 239–274). Springer.Simard, P. Y., Steinkraus, D., & Platt, J. C. (2003). Best practices for convolutional neural networks applied to visual document analysis. In

*ICDAR*.Sohn, K., & Lee, H. (2012). Learning invariant representations with local transformations. In

*ICML*.Tulsiani, S., Carreira, J., & Malik, J. (2015). Pose induction for novel object categories. In

*ICCV*.Tuytelaars, T., & Mikolajczyk, K. (2008). Local invariant feature detectors: A survey.

*Foundations and Trends in Computer Graphics and Vision*,*3*(3), 177–280.Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P. A. (2008). Extracting and composing robust features with denoising autoencoders. In

*ICML*.Wang, X., & Gupta, A. (2015). Unsupervised learning of visual representations using videos. In

*CVPR*.Watter, M., Springenberg, J., Boedecker, J., & Riedmiller, M. (2015) Embed to control: A locally linear latent dynamics model for control from raw images. In

*NIPS*.Wiskott, L., & Sejnowski, T. J. (2002). Slow feature analysis: Unsupervised learning of invariances.

*Neural Computation*,*14*(4), 715–770.Wu, Z., Song, S., Khosla, A., Tang, X., & Xiao, J. (2015). 3d shapenets for 2.5d object recognition and next-best-view prediction. In

*CVPR*.Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In

*CVPR*.Xu, C., Liu, J., & Kuipers, B. (2012). Moving object segmentation using motor signals. In

*ECCV*.Yamada, K., Sugano, Y., Okabe, T., Sato, Y., Sugimoto, A., & Hiraki, K. (2012). Attention prediction in egocentric video using motion and visual saliency. In

*PSIVT*.Zou, W., Zhu, S., Yu, K., & Ng, A. Y. (2012). Deep learning of invariant features via simulated fixations in video. In

*NIPS*.

## Acknowledgements

This research is supported in part by ONR PECASE Award N00014-15-1-2291. We also thank Texas Advanced Computing Center for their generous support, Pulkit Agrawal for sharing models and code and for helpful discussions, Ruohan Gao for helpful discussions, and our anonymous reviewers for constructive suggestions.

## Author information

### Affiliations

### Corresponding author

## Additional information

Communicated by K. Ikeuchi.

## Appendix

### Appendix

### Optimization and Hyperparameter Selection

(Elaborating on para titled “Network architectures and Optimization” in Sect. 4.1)

We use Nesterov-accelerated stochastic gradient descent as implemented in Caffe (Jia et al. 2014), starting from weights randomly initialized according to Glorot and Bengio (2010). The base learning rate and regularization \(\lambda \)s are selected with greedy cross-validation. Specifically, for each task, the optimal base learning rate (from 0.1, 0.01, 0.001, 0.0001) was identified for clsnet. Next, with this base learning rate fixed, the optimal regularizer weight (for drlim, temporal and equiv) was selected from a logarithmic grid (steps of \(10^{0.5}\)). For equiv + drlim, the drlim loss regularizer weight fixed for drlim was retained, and only the equiv loss weight was cross-validated. The contrastive loss margin parameter \(\delta \) in Eq. (8) in drlim, temporal and equiv were set uniformly to 1.0. Since no other part of these objectives (including the softmax classification loss) depends on the scale of features,^{Footnote 15} different choices of margins \(\delta \) in these methods lead to objective functions with equivalent optima—the features are only scaled by a factor. For equiv + drlim, we set the drlim and equiv margins respectively to 1.0 and 0.1 to reflect the fact that the equivariance maps \(M_g\) of Eq. (7) applied to the representation \(\mathbf {z_{\varvec{\theta }}}(g\varvec{x})\) of the transformed image must bring it closer to the original image representation \(\mathbf {z_{\varvec{\theta }}}(\varvec{x})\) than it was before i.e., \(\Vert M_g \mathbf {z_{\varvec{\theta }}}(g\varvec{x})- \mathbf {z_{\varvec{\theta }}}(\varvec{x})\Vert _2 < \Vert \mathbf {z_{\varvec{\theta }}}(g\varvec{x}) - \mathbf {z_{\varvec{\theta }}}(\varvec{x})\Vert _2\).

In addition, to allow fast and thorough experimentation, we set the number of training epochs for each method on each dataset based on a number of initial runs to assess the scale of time usually taken before the classification softmax loss on validation data began to rise i.e., overfitting began. All future runs for that method on that data were run to roughly match (to the nearest 5000) the number of epochs identified above. For most cases, this number was of the order of 50,000. Batch sizes (for both the classification stack and the Siamese networks) were set to 16 (found to have no major difference from 4 or 64) for NORB-NORB and KITTI-KITTI, and to 128 (selected from 4, 16, 64, 128) for KITTI-SUN, where we found it necessary to increase batch size so that meaningful classification loss gradients were computed in each SGD iteration, and training loss began to fall, despite the large number (397) of classes.

On a single Tesla K-40 GPU machine, NORB-NORB training tasks took \(\approx \)15 min, KITTI-KITTI tasks took \(\approx \)30 min, and KITTI-SUN tasks took \(\approx \)2 h. The purely unsupervised training runs with large “KITTI-227” images took up to 15 h per run.

### The Weakness of the Slow Feature Analysis Prior

We now present evidence supporting our claim in the paper that the principle of slowness, which penalizes feature variation within small temporal windows, provides a prior that is rather weak. In every stochastic gradient descent (SGD) training iteration for the drlim and temporal networks, we also computed a “slowness” measure that is independent of feature scaling (unlike the drlim and temporal losses of Eq. (9) themselves), to better understand the shortcomings of these methods.

Given training pairs \((\varvec{x}_i,\varvec{x}_j)\) annotated as neighbors or non-neighbors by \(n_{ij}=\mathbbm {1}(|t_i-t_j|\le T)\) (cf. Eq. (9) in the paper), we computed pairwise distances \(\varDelta _{ij}=d(\mathbf {z}_{\varvec{\theta }(s)}(\varvec{x}_i), \mathbf {z}_{\varvec{\theta }(s)}(\varvec{x}_j))\), where \(\varvec{\theta }(s)\) is the parameter vector at SGD training iteration *s*, and *d*(., .) is set to the \(\ell _2\) distance for drlim and to the \(\ell _1\) distance for temporal (cf. Sect. 4).

We then measured how well these pairwise distances \(\varDelta _{ij}\) predict the temporal neighborhood annotation \(n_{ij}\), by measuring the Area Under Receiver Operating Characteristic (AUROC) when varying a threshold on \(\varDelta _{ij}\).

These “slowness AUROC”s are plotted as a function of training iteration number in Fig. 12, for drlim and coherence networks trained on the KITTI-SUN task. Compared to the standard random AUROC value of 0.5, these slowness AUROCs tend to be near 0.9 already even before optimization begins, and reach peak AUROCs very close to 1.0 on both training and testing data within about 4000 iterations (batch size 128). This points to a possible weakness in these methods—even with parameters (temporal neighborhood size, regularization \(\lambda \)) cross-validated for recognition, the slowness prior is too weak to regularize feature learning effectively, since strengthening it causes loss of discriminative information. In contrast, our method requires *systematic* feature space responses to egomotions, and offers a stronger prior.

## Rights and permissions

## About this article

### Cite this article

Jayaraman, D., Grauman, K. Learning Image Representations Tied to Egomotion from Unlabeled Video.
*Int J Comput Vis* **125, **136–161 (2017). https://doi.org/10.1007/s11263-017-1001-2

Received:

Accepted:

Published:

Issue Date:

### Keywords

- Feature Space
- Convolutional Neural Network
- Feature Learning
- Temporal Coherence
- Scene Recognition