How to Train Your Differentiable Filter

In many robotic applications, it is crucial to maintain a belief about the state of a system, which serves as input for planning and decision making and provides feedback during task execution. Bayesian Filtering algorithms address this state estimation problem, but they require models of process dynamics and sensory observations and the respective noise characteristics of these models. Recently, multiple works have demonstrated that these models can be learned by end-to-end training through differentiable versions of recursive filtering algorithms. In this work, we investigate the advantages of differentiable filters (DFs) over both unstructured learning approaches and manually-tuned filtering algorithms, and provide practical guidance to researchers interested in applying such differentiable filters. For this, we implement DFs with four different underlying filtering algorithms and compare them in extensive experiments. Specifically, we (i) evaluate different implementation choices and training approaches, (ii) investigate how well complex models of uncertainty can be learned in DFs, (iii) evaluate the effect of end-to-end training through DFs and (iv) compare the DFs among each other and to unstructured LSTM models.


I. INTRODUCTION
In many robotic applications, it is crucial to maintain a belief about the state of the system over time, like tracking the location of a mobile robot or the pose of a manipulated object. These state estimates serve as input for planning and decision making and provide feedback during task execution. In addition to tracking the system state, it can also be desirable to estimate the uncertainty associated with the state predictions. This information can be used to detect failures and enables risk-aware planning, where the robot takes more cautious actions when its confidence in the estimated state is low [1,2].
Recursive Bayesian filters are a class of algorithms that combine perception and prediction for probabilistic state estimation in a principled way. To do so, they require an observation model that relates the estimated state to the sensory observations and a process model that predicts how the state develops over time. Both have associated noise models that reflect the stochasticity of the underlying system and determine how much trust the filter places in perception and prediction.
Formulating good observation and process models for the filters can, however, be difficult in many scenarios, especially when the sensory observations are high-dimensional and complex, like camera images. Over the last years, deep learning has become the method of choice for processing 1 Max Planck Institute for Intelligent Systems, <akloss, gmartius>@tue.mpg.de 2 Stanford University, bohg@stanford.edu The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Alina Kloss. such data. While (recurrent) neural networks can be trained to address the full state estimation problem directly, recent work [3,4,5,6] showed that it is also possible to include data-driven models into Bayesian filters and train them end-to-end through the filtering algorithm. For Histogram filters [3], Kalman filters [4] and Particle filters [5,6], the respective authors showed that such differentiable filters (DF) systematically outperform unstructured neural networks like LSTMs [7]. In addition, the end-to-end training of the models also improved the filtering performance compared to using observation and process models that had been trained separately.
A further interesting aspect of differentiable filters is that they allow for learning sophisticated models of the observation and process noise. This is useful because finding appropriate values for the noise models is often difficult and despite much research on identification methods (e.g. [8,9]) they are often tuned manually in practice. To reduce the tedious tuning effort, the noise is then typically assumed to be uncorrelated Gaussian noise with zero mean and constant covariance. Many real systems are, however, better described by heteroscedastic noise models, where the level of uncertainty depends on the state of the system and/or possible control inputs. Taking heterostochasticity of the dynamics into account has been demonstrated to improve filtering performance in many robotic tasks [10,11]. [4] also show that learning heteroscedastic observation noise helps a Kalman filter dealing with occlusions during object tracking.
In this paper, we perform a thorough evaluation of differentiable filters. Our main goals are to highlight the advantages of DFs over both unstructured learning approaches and manually-tuned filtering algorithms, and to provide guidance to practitioners interested in applying differentiable filtering to their problems.
To this end, we review and implement existing work on differentiable Kalman and Particle filters and introduce two novel variants of differentiable Unscented Kalman filters. Our implementation for TensorFlow [12] is publicly available 1 . In extensive experiments on three different tasks, we compare the DFs and evaluate different design choices for implementation and training, including loss functions and training sequence length. We also investigate how well the different filters can learn complex heteroscedastic and correlated noise models, evaluate how end-to-end training through the DFs influences the learned models and compare the DFs to unstructured LSTM models.

A. Combining Learning and Algorithms
Integrating algorithmic structure into learning methods has been studied for many robotic problems, including state estimation [4,3,5,6,13], planning [14,15,16,17,18] and control [19,20,21,22,23]. Most notably, [24] combine multiple differentiable algorithms into an end-to-end trainable "Differentiable Algorithm Network" to address the complete task of navigating to a goal in a previously unseen environment using visual observations. Here, we focus on addressing the state estimation problem with differentiable implementations of Bayesian filters.

B. Differentiable Bayesian Filters
There have been few works on differentiable filters so far. [4] propose the BackpropKF, a differentiable implementation of the (extended) Kalman filter. [3] present a differentiable Histogram filter for discrete localization tasks in one or two dimensions and [5] and [6] both implement differentiable Particle filters for localization and tracking of a mobile robot. In the following, we focus our discussion on differentiable Kalman and Particle filters, since Histogram filters as used by [3] are usually not feasible in practice, due to the need of discretizing the complete state space. a) Observation Model and Noise: All three works have in common that the raw observations are processed by a learned neural network that can be trained end-toend through the filter. In [4], the network outputs a lowdimensional representation of the observations together with input-dependent observation noise (see Sec. IV-B), while in [5,6], a neural network learns to predict the likelihood of the observations under each particle given an image and (in [6]) a map of the environment.
As a result, all three works use heteroscedastic observation noise, but only [4] evaluate this choice: They show that conditioning the observation noise on the raw image observations drastically improves filter performance when the tracked object can be occluded. b) Process Model and Noise: For predicting the next state, all three works use a given analytical process model. While [4] and [6] also assume known process noise, [5] train a network to predict it conditioned on the actions. The effect of learning action dependent process noise is, however, not evaluated.
c) Effect of End-to-End Learning: [5] compare the results of an end-to-end trained filter with one where the observation model and process noise were trained separately. The end-to-end trained variant performs better, presumably because it learns to overestimate the process noise. Possible differences between the learned observation models are not discussed. The best performance for the filter could be reached by first pretraining the models individually and the finetuning end-to-end through the filter. d) Comparison to Unstructured Models: All works compare their differentiable filters to LSTM models trained for the same task and find that including the structural priors of the filtering algorithm and the known process models improves performance. [5] also evaluate a Particle filter with a learned process model in one experiment, which performs worse than the filter with an analytical process model but still beats the LSTM.
In contrast to the existing work on differentiable filtering, the main purpose of this paper is not to present a new method for solving a robotic task. Instead, we present a thorough evaluation of differentiable filtering and of implementation choices made by the aforementioned seminal works. We also implement two novel differentiable filters based on variants of the Unscented Kalman filter and compare the differentiable filters with different underlying Bayesian filtering algorithms in a controlled way.

C. Variational Inference
A second line of research closely related to differentiable filters is variational inference in temporal state space models [25,26,27,28,29]. For a recent review of this work, see [30]. In contrast to DFs, the focus of this research lies more on finding generative models that explain the observed data sequences and are able to generate new sequences. The representation of the underlying state of the system is often not assumed to be known. But even though the goals are different, recent results in this field show that structuring the variational models similarly to Bayesian filters improves their performance [26,28,31,32,33].

III. BAYESIAN FILTERING FOR STATE ESTIMATION
Filtering refers to the problem of estimating the latent state x of a stochastic dynamic system at time step t given an initial belief bel(x 0 ) = p(x 0 ), a sequence of observations z 1...t and actions u 0...t−1 . Formally, we seek the posterior distribution bel( Bayesian Filters make the Markov assumption, i.e. that the distributions of the future states and observations are conditionally independent from the history of past states and observations given the current state. This assumption makes it possible to compute the belief at time t recursively as where η is a normalization factor. Computing bel(x t ) is referred to as the prediction step of Bayesian filters, while updating the belief with p(z t |x t ) is called (observation) update step.
For the prediction step, the dynamics of the system is modeled by the process model f that describes how the state changes over time. The observation update step uses an observation model h that generates observations given the current state: The random variables q and r are the process and observation noise and capture the stochasticity of the system.
In this paper, we investigate differentiable versions of four different nonlinear Bayesian filtering algorithms: The Extended Kalman Filter (EKF), the Unscented Kalman Filter (UKF), a sampling-based variant of the UKF that we call Monte Carlo Unscented Kalman Filter (MCUKF) and the Particle Filter (PF). We briefly review these algorithms in Appendix I.

IV. IMPLEMENTATION
In this section, we describe how we embed model-learning into differentiable versions of the aforementioned nonlinear filtering algorithms. These differentiable versions will be denoted by dEKF, dUKF etc. in the following.

A. Differentiable Filters
We implement the filtering algorithms as recurrent neural network layers in TensorFlow. For UKF and MCUKF, this is straight-forward, since all necessary operations are differentiable and available in TensorFlow.
In contrast, the dEKF requires the Jacobian of the process model F. TensorFlow implements a method for computing Jacobians, with or without vectorization. The former is fast but has a high memory demand, while the latter can become very slow for large batch sizes. Therefore, we recommend to derive the Jacobians manually where applicable.
1) dPF: The Particle filter is the only filter we investigate that is not fully differentiable: In the resampling step, a new set of particles with uniform weights is drawn (with replacement) from the old set according to the old particle weights. While the drawn particles can propagate gradients to their ancestors, gradient propagation to other old particles or to the weights of the old particle set is disrupted [5,6,34]. If we place the resampling step at the beginning of the per-timestep computations, this only affects the gradient propagation through time, i.e. from one timestep t + 1 to its predecessor t. At time t, both particles and weights still receive gradient information about the corresponding loss at this timestep. We therefore hypothesize that the missing gradients through time are not problematic as long as we provide a loss at every timestep.
As an alternative to simply ignoring the disrupted gradients, we can also apply the resampling step less frequently or use soft resampling as proposed by [6]. We evaluate these options in Sec. VI-B.5.
In addition, we investigate two alternative implementation choices for the dPF: The likelihood used for updating the particle weights in the observation update step can be implemented either with an analytical Gaussian likelihood function or with a trained neural network as in [5] and [6]. The learned observation likelihood is potentially more expressive than the analytical solution and can be advantageous for problems where formulating the observation and sensor model is not as straight-forward as in our experiments. A potential drawback is that in contrast to the analytical solution, no explicit noise model or sensor network is learned. We compare these two options in Sec. VI-B.4.

B. Observation Model
In Bayesian filtering, the observation model h(·) is a generative model that predicts observations from the state z t = h(x t ). In practice, it is often hard to find such models that directly predict the potentially high-dimensional raw sensory signals without making strong assumptions.
We therefore use the method first proposed by [4] and train a discriminative neural network n s with parameters w s to preprocess the raw sensory data D and create a more compact representation of the observations z = n s (D, w s ). This network can be seen as a virtual sensor, and we thus call it sensor network. In addition to z t , the sensor network can also predict the heteroscedastic observation noise covariance matrix R t (see Sec. IV-D) for the current input D t .
In our experiments, z contains a subset of the state vector x. The actual observation model h(x) thus reduces to a simple linear selection matrix of the observable components, which we provide to the DFs.

C. Process Model
Depending on the user's knowledge about the system, the process model f (·) for the prediction step can be implemented using a known analytical model or a neural network n p (·) with weights w p . When using neural networks, we train n p (·) to output the change from the last state n p (x t , u t , w p ) = ∆x t such that x t+1 = x t + ∆x t . This form ensures stable gradients between timesteps (since ∂xt+1 ∂xt = 1 + ∂p ∂xt ) and provides a reasonable initialization of the process model close to identity.

D. Noise Models
For learning the observation and process noise, we consider two different conditions: constant and heteroscedastic. In both cases, we assume that the process and observation noise at time t can be described by zero-mean Gaussian distributions with covariance matrices Q t and R t .
A common assumption in state-space modeling is that Q t and R t are diagonal matrices, but we can also use full covariance matrices to model correlated noise. In this case, we follow [4] and train the noise models to output uppertriangular matrices L t , such that e.g. Q t = L t L T t . This form ensures that the resulting matrices are positive definite.
For constant noise, the filters directly learn the diagonal or triangular elements of Q and R. In the heteroscedastic case, Q t is predicted from the current state x t and (if available) the control input u t by a neural network n q (x t , u t , w q ) with weights w q . In dUKF, dMCUKF and dPF, n q (·) outputs separate Q i for each sigma point/particle and Q t is computed as their weighted mean. The heteroscedastic observation noise covariance matrix R t is an additional output of the sensor model n s (D t , w s ).
We initialize the diagonals Q t and R t close to given target values by adding a trainable bias variable to the output of the noise models. To prevent numerical instabilities, we also add a small fixed bias to the diagonals as a lower bound for the predicted noise.

E. Loss Function
For training the filters, we always assume that we have access to the ground truth trajectory of the state x l t=0...T . In our experiments, we test the two different loss functions used in related work: The first, used by [6] is simply the mean squared error (MSE) between the mean of the belief and true state at each timestep: For the dPF, we compute µ as the weighted mean of the particles. The second loss function, used by [4] and [5], is the negative log likelihood (NLL) of the true state under the predicted distribution of the belief. In dEKF, dUKF and dMCUKF, the belief is represented by a Gaussian distribution with mean µ t and covariance Σ t and the negative log likelihood is computed as The dPF represents its belief using the particles χ i ∈ X and their weights π i . We consider two alternative ways of calculating the NLL for training the dPF: The first is to represent the belief by fitting a single Gaussian to the particles, with µ = N i=0 π i χ i and Σ = N i=0 π i (χ i − µ)(χ i − µ) T and then apply Eq. 2. We refer to this variant as dPF-G.
However, this is only a good representation of the belief if the distribution of the particles is unimodal. To better reflect the potential multimodality of the particle distribution, the belief can also be represented with a Gaussian Mixture Model (GMM) as proposed by [5]. Every particle contributes a separate Gaussian N i (χ i , Σ) in the GMM and the mixture weights are the particle weights. The drawback of this approach is that the fixed covariance Σ of the individual distributions is an additional tuning parameter for the filter. We call this version dPF-M and calculate the negative log likelihood with

V. EXPERIMENTAL SETUP
In the following, we will evaluate the DFs on three different filtering problems. We start with a simple simulation setting that gives us full control over parameters of the system such as the true process noise (Sec. VI). In Sections VII and VIII, we then study the performance of the DFs on two real-robot tasks: The first is the KITTI Visual Odometry problem, where the filters are used to track the position and heading of a moving car given only RGB images. The second is planar pushing, where the filters track the pose of an object while a robot performs a series of pushes.
Unless stated otherwise, we will train the DFs end-to-end for 15 epochs using the Adam optimizer [35] and select the Fig. 1: Two sequential observations from our simulated tracking task. The filters need to track the red disc, which can be occluded by the other discs or leave the image temporarily. model state at the training step with the best validation loss for evaluation. We also evaluate different learning rates for all DFs. During training, the initial state is perturbed with noise sampled from a Normal distribution N init (0, Σ init ). For testing, we evaluate all DFs with the true initial state as well as with few fixed perturbations (sampled from N init ) and average the results.
More detailed information about the experimental conditions as well as extended results can be found in Appendix II-A-IV.

VI. SIMULATED DISC TRACKING
We first evaluate the DFs in a simulated environment similar to the one in [4]: the task is to track a red disc moving among varying numbers of distractor discs, as shown in Figure 1. The state consists of the position p and linear velocity v of the red disc.
The dynamics model that we use for generating the training data is The velocity update contains a force that pulls the discs towards the origin (f p = 0.05) and a drag force that prevents too high velocities (f d = 0.0075). q represents the Gaussian process noise and sgn(x) returns the sign of x or 0 if x = 0. The sensor network receives the current image at each step, from which it can estimate the position but not the velocity of the target. As we do not model collisions, the red disc can be occluded by the distractors or leave the image temporarily.

A. Data
We create multiple datasets with varying numbers of distractors, different levels of constant process noise for the disc position and constant or heteroscedastic process noise for the disc velocity. All datasets contain 2400 sequences for training, 300 validation sequences and 303 sequences for testing. The sequences have 50 steps and the colors and sizes of the distractors are drawn randomly for each sequence.

B. Filter Implementation and Parameters
We first evaluated different design choices and filterspecific parameters for the DFs to find settings that perform well and increase the stability of the filters during training. For detailed information about the experiments and results, please refer to Appendix II-B.
1) dUKF: The dUKF has three filter-specific scaling parameters, α, κ and β. α and κ determine how far from the mean of the belief the sigma points are placed and how the mean is weighted in comparison to the other sigma points. β only affects the weight of the central sigma point when computing the covariance of the transformed distribution.
We evaluated different parameter settings but found no significant differences between them. In all following experiments, we use α = 1, κ = 0.5 and β = 0. In general, we recommend values for which λ = α 2 (κ + n) − n is a small positive number, so that the sigma points are not spread out too far and the central sigma point is not weighted negatively (which happens for negative λ). See Appendix I-C for a more detailed explanation.
2) dMCUKF: In contrast to the dUKF, the dMCUKF simply samples pseudo sigma points from the current belief. Its only parameter thus is the number N of sampled points during training and testing.
We trained the dMCUKF with N ∈ {5, 10, 50, 100, 500} and evaluated with 500 pseudo sigma points. The results show that as few as ten sigma points are enough for training the dMCUKF relatively successfully. The best results are obtained with 100 sigma points and using more does not reliably increase the performance.
In the following, we use 100 points for training and 500 for testing. More complex problems with higher-dimensional states could, however, require more sigma points.
3) dPF: Belief Representation: When training the dPF on L NLL , we have to choose how to represent the belief of the filter for computing the likelihood (see Sec. IV-E). We investigate using a single Gaussian (dPF-G) or a Gaussian Mixture Model (dPF-M). For the dPF-M, the covariance Σ of the single Gaussians in the Mixture Model is an additional parameter that has to be tuned.
As our test scenario does not require tracking multiple hypotheses, the representation by a single Gaussian in dPF-G should be accurate for this task. Nonetheless, we find that the dPF-G performs much worse than the dPF-M. This could either mean that Eq. 3 facilitates training or that approximating the belief with a single Gaussian removes useful information even when the task does not obviously require tracking multiple hypotheses. Interestingly, when using a learned observation update, this effect is not noticeable, which suggests that the first hypothesis is correct. In the following, we only report results for the dPF-M. Results for dPF-G can be found in the Appendix.
For the dPF-M, Σ = 0.25I 4 (I 4 denotes an identity matrix with 4 rows and columns) resulted in the best tracking errors, but the best NLL was achieved with Σ = I 4 . We thus use Σ = I 4 for the dPF-M in all following experiments. It is, however, possible that different tasks could require different settings. 4) dPF: Observation Update: As mentioned before, the likelihood for the observation update step of the dPF can be implemented with an analytical Gaussian likelihood function (dPF-(G/M)) or with a neural network (dPF-(G/M)-lrn).
Our experiments showed that using a learned likelihood function for updating the particle weights can improve both tracking error and NLL of the dPF significantly. We attribute this mainly to the fact that the learned update relaxes some of the assumptions encoded in the particle filter: With the analytical version, we restrict the filter to use additive Gaussian noise that is either constant or depends only on the raw sensory observations. The learned update, in contrast, enforces no functional form of the noise model. In addition, the noise can depend not only on the raw sensory data, but also on the observable components of the particle states. This means that the learned observation update is potentially much more expressive than the analytical one, which pays off when the Gaussian assumption made by the other filtering algorithms does not hold.
While learning the observation update improves the performance of the dPF, we will still use the analytical variant in most of the following evaluations. The main reason for this is that the analytical observation update has explicit models for the sensor network and observation noise. This facilitates comparing between the dPF and the other DF variants and gives us control over the form of the learned observation noise.
5) dPF: Resampling: The resampling step of the particle filter discards particles with low weights and prevents particle depletion. It may, however, be disadvantageous during training since it is not fully differentiable. [6] proposed soft resampling, where the resampling distribution is traded off with a uniform distribution to enable gradient flow between the weights of the old and new particles. This trade-off is controlled by a parameter α re ∈ [0, 1]. The higher α re , the more weight is put on the uniform distribution. An alternative to soft resampling is to not resample at every timestep.
We tested the dPF-M with different values of α re and when resampling every 1, 2, 5 or 10 steps and found that resampling frequently generally improves the filter performance. Soft resampling also did not have much of a positive effect in our experiments, presumably because higher values of α re decrease the effectiveness of the resampling step. In the following, we use α re = 0.05 and resample at every timestep.
6) dPF: Number of Particles: Finally, the user also has to decide how many particles to use during training and testing. As for the dMCUKF, we trained the dPF-M with N ∈ {5, 10, 50, 100, 500}. The results were very similar to dMCUKF and we also use 100 particles during training and 500 particles for testing.

C. Loss Function
In this experiment we compare the different loss functions introduced in Sec. IV-E, as well as a combination of the two L mix = 0.5(L MSE + L NLL ). Our hypothesis is that L NLL is better suited for learning noise models, since it requires predicting the uncertainty about the state, while L MSE only optimizes the tracking performance. a) Experiment: We use a dataset with 15 distractors and constant process noise (σ qp = 0.1, σ qv = 2). The filters learn the sensor and process model as well as heteroscedastic observation noise and constant process noise models. b) Results: As expected, training on L NLL leads to much better likelihoods scores than training on L MSE for all DFs, see Fig. 2. The best tracking errors on the other hand are reached with L MSE , as well as more precise sensor models.
For judging the quality of a DF, both NLL and tracking error should be taken into account: While a low RMSE is important for all tasks that use the state estimate, a good likelihood means that the uncertainty about the state is communicated correctly, which enables e.g. risk-aware planning and failure detection.
The combined loss L mix trades off these two objectives during training. It does, however, not outperform the single losses in their respective objective. A possible explanation is that they can result in opposing gradients: All DFs tend to overestimate the process noise when trained only on L MSE . This lowers the tracking error by giving more weight to the observations in dEKF, dUKF and dMCUKF and allowing more exploration in the dPF. But it also results in a higher uncertainty about the state, which is undesirable when optimizing the likelihood.
We generally recommend using L NLL during training to ensure learning accurate noise models. If learning the process and sensor model does not work well, L NLL can either be combined with L MSE or the models can be pretrained.

D. Training Sequence Length
[6] evaluated training their dPF on sequences of length k ∈ {1, 2, 4} and found that using more steps improved results. Here, we want to test if increasing the sequence length even further is beneficial. However, longer training sequences also mean longer training times (or more memory consumption). We thus aim to find a value for k with a good trade off between training speed and model performance.
a) Experiment: We evaluate the DFs on a dataset with 15 distractors and constant process noise (σ qp = 0.1, σ qv = 2). The filters learn the sensor and process model as well as heteroscedastic observation noise and constant process noise models. We train using L NLL on sequence lengths k ∈ {1, 2, 5, 10, 25, 50} while keeping the total number of examples per batch (steps × batch size) constant. b) Results: Our results in Figure 3 show that all filters benefit from longer training sequences much more than the results in [6] indicated. However, while only one time step is clearly too little, returns diminish after around ten steps.
Why are longer training sequences helpful? One issue with short sequences is that we use noisy initial states during training. This reflects real-world conditions, but the noisy inputs hinder learning the process model. On longer sequences, the observation updates can improve the state estimate and thus provide more accurate input values.
We repeated the experiment without perturbing the initial state, but the results with k ∈ {1, 2} got even worse: Since the DFs could now learn accurate process models, they did not need the observations to achieve a low training loss and thus did not learn a proper sensor model. On the longer test sequences, however, even small errors from the noisy dynamics accumulate over time if they are not corrected by the observations.
To summarize, longer sequences are beneficial for training DFs, because they demonstrate error accumulation during filtering and allow for convergence of the state estimate when the initial state is noisy. However, performance eventually saturates and increasing k also increased our training times. We therefore chose k = 10 for all experiments, which provides a good trade-off between training speed and performance.

E. Learning Noise Models
The following experiments analyze how well complex models of the process and observation noise can be learned through the filters and how much this improves the filter performance. To isolate the effect of the noise models, we use a fixed, pretrained sensor model and the true analytical process model, such that only the noise models are trained. We initialize Q and R with Q = I 4 and R = 100I 2 . All DFs are trained on L NLL .
Appendix II-C contains extended experimental results on additional datasetsas well as data for the dPF-G.
1) Heteroscedastic Observation Noise: We first test if learning more complex, heteroscedastic observation noise models improves the performance of the filters as compared to learning constant noise models. For this, we compare DFs that learn constant or heteroscedastic observation noise (the process noise is constant) on a dataset with constant process noise (σ qp = 3, σ qv = 2) and 30 distractors.
To measure how well the predicted observation noise reflects the visibility of the target disc, we compute the correlation coefficient between the predicted R and the number of visible target pixels. We also evaluate the similarity between the learned and the true process noise model using the Bhattacharyya distance. a) Results: Results are shown in Table I. When learning constant observation noise, all DFs perform relatively bad in terms of the tracking error. Upon inspection, we find that all filters learn a very high R and thus mostly rely on the process model for their prediction. For example, the dEKF predicts σ rp = 25.4. This is expected, since trusting the observations would result in wrong updates to the mean state estimate when the target disc is occluded.
Like [4], we find that heteroscedastic observation noise significantly improves the tracking performance of all DFs (except for the dPF-M). The strong negative correlation between R and the visible disc pixels shows that the DFs  : Predicted and true process noise from the dEKF over one test sequence of the disc tracking task. Our model predicts separate values for the x and y-coordinates of position and velocity, but the ground truth process noise has the same σ for both coordinates.
correctly predict higher uncertainty when the target is occluded. For example, the dEKF predicts values as low as σ rp = 0.9 when the disc is perfectly visible and as high as σ rp = 29.3 when it is fully occluded. Finally, all DFs learn values of Q that are close to the ground truth. For dEKF, dUKF and dMCUKF, the results improve significantly when heteroscedastic observation noise is learned. This could be because the worse tracking performance with constant observation noise impedes learning an accurate process model and thus requires higher process noise.
2) Heteroscedastic Process Noise: The effect of learning heteroscedastic process noise has not yet been evaluated in related work. We create datasets with heteroscedastic ground truth process noise, where the magnitude of q v increases in three steps the closer to the origin the disc is. The positional process noise q p remains constant (σ qp = 3.0). We compare the performance of DFs that learn constant and heteroscedastic process noise while he observation noise is heteroscedastic in all cases. a) Results: As shown in Table II, learning heteroscedastic models of the process noise is a bit more difficult than for the observation noise. This is not surprising, as the input values for predicting the process noise are the noisy state estimates.
Plotting the predicted values for Q (see Fig. 4 for an example from the dEKF) reveals that all DFs learn to follow the real values for the heteroscedastic velocity noise relatively well, but also predict state dependent values for q p , which is actually constant. This could mean that the models have difficulties distinguishing between q p and q v as sources of uncertainty about the disc position. However, we see the same behavior also on a dataset with constant ground truth process noise. We thus assume that the models rather pick up an unintentional pattern in our data: The probability of the disc being occluded turned out to be higher in the middle of the image. The filters react to this by overestimating q p in the center, which results in an overall higher uncertainty about the state in regions where occlusions are more likely.
Despite not being completely accurate, learning heteroscedastic noise models still increases performance of all DFs by a small but reliable value. Even when the groundtruth process noise model is constant, most of the DFs were able to improve their RSME and likelihood scores slightly by learning "wrong" heteroscedastic noise models.
3) Correlated Noise: So far, we have only considered noise models with diagonal covariance matrices. In this experiment, we want to see if DFs can learn to identify correlations in the noise. We compare the performance of DFs that learn noise models with diagonal or full covariance matrix on datasets with and without correlated process noise. Both the learned process and the observation noise model are also heteroscedastic.
The results (see Appendix II-C.3) show that learning correlated noise models leads to a further small improvement of the performance of all DFs when the true process noise is correlated. However, uncovering correlations in the noise seems to be even more difficult than learning accurate heteroscedastic noise models, as indicated by the still high Bhattacharyya distance between true and learned Q. In the final experiment on this task, we compare the performance of the DFs among each other and to two LSTM models. We use an LSTM architecture similar to [5], with one or two layers of LSTM cells (512 units each). The LSTM state is decoded into mean and covariance of a Gaussian state estimate. a) Experiment: All models are trained for 30 epochs. The DFs learn the sensor and process models with heteroscedastic, diagonal noise models. We compare their performance on datasets with 30 distractors and different levels of constant or heteroscedastic process noise. Each experiment is repeated two times to account for different initializations of the weights. b) Results: The results in Table III show that all models (except for the dPF-G, see Appendix Table S7) learn to track the target disc well and make reasonable uncertainty predictions. In terms of tracking error, the dPF with learned observation update performs best on all evaluated datasets. This, however, often does not extend to the likelihood scores. For the NLL, the dMCUKF instead mostly achieves the best results, however, not with a significant advantage over the other DFs.

F. Benchmarking
If we exclude the dPF variant with learned observation model (which is more expressive than the other DFs), we can see that the choice of the underlying filtering algorithm does not make a big difference for the performance on this task. The unstructured LSTM model, in contrast, requires two layers of LSTM cells (each with 512 units per layer) to reach the performance of the DFs. Unstructured models like LSTM can thus learn to perform similar to differentiable filters, but require a much higher number of trainable parameters than the DFs which increases computational demands and the risk of overfitting.

VII. KITTI VISUAL ODOMETRY
As a first real-world application we study the KITTI Visual Odometry problem [36] that was also evaluated by [4] and [5]. The task is to estimate the position and orientation of a driving car given a sequence of RGB images from a front facing camera and the true initial state.
The state is 5-dimensional and includes the position p and orientation θ of the car as well as the current linear and angular velocity v andθ. The real control input u = vθ T is unknown and we thus treat changes in v andθ as results of the process noise. The position and heading estimate can be updated analytically by Euler integration.
While the dynamics model is simple, the challenge in this task comes from the unknown actions and the fact that the absolute position and orientation of the car cannot be observed from the RGB images. At each timestep, the filters receive the current images as well as a difference image between the current and previous timestep. From this, the filters can estimate the angular and linear velocity to update the state, but the uncertainty about the absolute position and heading will inevitably grow due to missing feedback. Please refer to Appendix III-A for details on the implementation of the sensor network, the learned process model and the learned noise models.

A. Data
The KITTI Visual Odometry dataset consists of eleven trajectories of varying length (from 270 to over 4500 steps) with ground truth annotations for position and heading and image sequences from two different cameras collected at 10 Hz.
Following [4] and [5], we build eleven different datasets. Each of the original trajectories is used as the test split of one dataset, while the remaining 10 sequences are used to construct the training and validation split.
To augment the data, we use the images from both cameras for each trajectory and also mirror the sequences. For training and validation, we extract 200 sequences of length 50 with different random starting points from each augmented trajectory. This results in 1013 training and 287 validation sequences. For testing, we extract sequences of length 100 from the augmented test-trajectory. The number of test sequences depends on the overall length of the testtrajectory.
When looking at the statistics of the eleven trajectories in the original KITTI dataset, Trajectory 1 can be identified as an outlier: It shows driving on a highway, where the velocity of the car is much higher than in all the other trajectories. As a result, the sensor models trained on the other sequences will yield bad results when evaluated on Trajectory 1. We will therefore mostly report results for only a ten-fold crossvalidation that excludes the dataset for testing on Trajectory 1. We will refer to this as KITTI-10 while the full, eleven-fold cross validation will be denoted as KITTI-11. In Sec. VII-D, results for both settings are reported, such that the influence of Trajectory 1 becomes visible.

B. Learning Noise Models
In this experiment, we want to test how much the DFs profit from learning the process and observation noise models end-to-end through the filters, as compared to using handtuned or individually learned noise models.
We also again compare learning constant or heteroscedastic noise models. In contrast to the previous task, we do not expect as large a difference between constant or heteroscedastic observation noise for this task, as the visual input does not contain occlusions or other events that would drastically change the quality of the predicted observations z.
a) Experiment: As in the experiments on simulated data (Sec. VI-E), we use a fixed, pretrained sensor model and the analytical process model, and only train the noise models. We initialize Q and R with Q = I 5 and R = I 2 . All DFs are trained with L NLL and a sequence length of 25, which we found to be beneficial for learning the noise models in a preliminary experiment.
We compare the DFs when learning different combinations of constant or heteroscedastic process and observation noise.
As on baseline, we use DFs with fixed constant noise models that reflect the average validation error of the pretrained sensor model and the analytical process model. A second baseline fixes the noise models to those obtained by individual pretraining, where we evaluate both constant and heteroscedastic models. All DFs are evaluated on KITTI-10. b) Results: The results in Table IV show that learning the noise models end-to-end through the filters greatly improves the NLL but has no big effect on the tracking errors for this task. The DFs with the hand-tuned, constant noise model have the by far worst NLL because they greatly underestimate the uncertainty about the vehicle pose. The DFs that use individually trained noise models perform better, but are still overly confident.
For most of the DFs, we achieve the best results when learning constant observation and heteroscedastic process noise. The worst results are achieved when instead the observation noise is heteroscedastic and the process noise constant. This could indicate that the true process noise can be better modeled by a state-dependent noise model while learning heteroscedastic observation noise leads to overfitting to the training data. However, the differences are overall not very pronounced.
Finally, we also evaluated the DFs with full covariance matrices for the noise models. For the setting with constant observation and heteroscedastic process noise, using full  C. End-to-End versus Individual Training Previous work [5] has shown that end-to-end training through differentiable filters leads to better results than running the DFs with models that were trained individually. Specifically, pretraining the models individually and finetuning end-to-end resulted in the best tracking performance. As a possible explanation, the authors found that the individually trained process noise model predicted noise close to the ground truth whereas the end-to-end trained model overestimated to noise, which is believed to be beneficial for filter performance.
Does this mean that end-to-end training through DFs mostly affects the noise models? To test this, we pretrain all models individually and compare the performance of the DFs without finetuning, when finetuning only the noise models or only the sensor and process model and when finetuning everything. We also report results for training the DFs from scratch. a) Experiment: We pretrain sensor and process model and their associated (constant) noise models individually for 30 epochs. For finetuning, we load the pretrained models and finetune the desired parts for 10 epochs, while the end-toend trained versions are trained for 30 epochs. All variants are evaluated using KITTI-10 and trained using L NLL . b) Results: The results shown in Table V support our hypothesis that end-to-end training through the DFs is most important for learning the noise models: Finetuning only the noise models improved both RMSE and NLL of all DFs in comparison to the variants without finetuning or with finetuning only the sensor and process model (except for the dMCUKF). For dEKF and dPF, finetuning the sensor and process model even decreased the performance on both measures.
In terms of tracking error, individual pretraining plus finetuning the noise models lead to the best results on dEKF and dPF, while dUKF and dMCUKF performed slightly better when finetuning both sensor and process model and their noise models (dMCUKF) or even learning both from scratch (dUKF). For the NLL, finetuning only the noise models lead to the best results for all DFs, followed in most cases by training from scratch.
To summarize, the results indicate that individual pretraining is helpful for learning the sensor and process models, but not for the noise models. End-to-end training through the DFs, on the other hand, again proved to be important for optimizing the noise models for the respective filtering algorithm but did not offer advantages for learning the sensor and process model.

D. Benchmarking
In the final experiment on this task, we compare the performance of the DFs to an LSTM model. We again use VI: Results on KITTI: Comparison between the DFs and LSTM (mean and standard error). Numbers for prior work BKF*, LSTM* taken from [4] and DPF* taken from [5]. BKF* and DPF* use a fixed analytical process model while our DFs learn both, sensor and process model. m m and deg m denote the translation and rotation error at the final step of the sequence divided by the overall distance traveled. RMSE  an LSTM architecture similar to [5], but with only one layer of LSTM cells with 256 units. The LSTM state is decoded into an update for the mean and the covariance of a Gaussian state estimate. Like the process model of the DFs, the LSTM does not get the full initial state as input, but only those components that are necessary for computing a state update (velocities and sine and cosine of the heading). We chose this architecture in an attempt to make the learning task easier for the LSTM. a) Experiment: All models are trained for 30 epochs using L NLL , except for the LSTM, for which L mix lead to better results. The DFs learn the sensor and process models with constant noise models. We report their performance on KITTI-10 and KITTI-11, for comparison with prior work. b) Results: The results in Table VI show that by training all the models in the DFs from scratch, we can reach a performance that is competitive with prior work by [4], despite not relying on an analytical process model. We were, however, not able to reach the very good performance of the dPF reported by [5]. A possible cause for this could be that the normalization of the particles in the learned observation update used by [5] helps the method to better deal with the higher overall velocity in Trajectory 1 of the KITTI dataset.
In contrast to the DF, we were not able to train LSTM models that reached a good evaluation performance on this task, despite trying multiple different architectures and loss functions. Different from the experiments on the simulation task, increasing the number of units per LSTM-layer or using multiple LSTM layers even decreased the performance here. To complement our results, we also report an LSTM result from [4] that does better on the position error but worse on the orientation error. While these findings do not mean that a better performance could not be reached with unstructured models given different architectures or training routines, it still shows that the added structure of the filtering algorithms greatly facilitates learning in more complex problems.
For this task, the dPF-M-lrn again achieves the overall best tracking result, closely followed by the dUKF which reaches the lowest normalized endpoint position error ( m m ). One reason for the comparably bad performance of the dEKF could be that the dynamics of the Visual Odometry task are more strongly non-linear than in the previous experiments. Both UKF and PF can convey the uncertainty more faithfully in this case, which could lead to better overall results when training on L NLL . Given the relatively large standard errors, the differences between the DFs are, however, not significant.

VIII. PLANAR PUSHING
In the KITTI Visual Odometry problem, the main challenges were the unknown actions and dealing with the inevitably increasing uncertainty about the vehicle pose. With planar pushing, our second real-robot experiment in contrast addresses a task with much more complex dynamics. Apart from having non-linear and discontinuous dynamics (when the pusher makes or breaks contact with the object), [10] also showed that the noise in the system can be best captured by a heteroscedastic noise model.
With 10 dimensions, the state representation we use is also much larger than in our previous experiments. x contains the 2D position p o and orientation θ of the object, as well as the two friction-related parameters l and α m . In addition, we include the 2D contact point between pusher and object r, the normal to the object's surface at the contact point n and a contact indicator s. The control input u contains the start position p u and movement v u of the pusher.
An additional challenge of this task is that r and n are only properly defined and observable when the pusher is in contact with the object. We thus set the labels for n to zeros and r = p u for non-contact cases. a) Dynamics: We use an analytical model by [37] to predict the linear and angular velocity of the object (v o , ω) given the previous state and the pusher motion v u . However, predicting the next r, n and s is not possible with this model Fig. 5: Examples of the rendered RGB images that we use as observations for the pushing task. The last example shows that the robot arm can partially occlude the object in some positions.
since this would require access to a representation of the object shape.
For r, we thus use a simple heuristic that predicts the next contact point as r t+1 = r t + v u,t . n and s are only updated when the angle between pusher movement and (inwards facing) normal is greater than 90 • . In this case, we assume that the pusher moves away from the object and set s t+1 and n t+1 to zeros. b) Observations: Our sensor network receives simulated RGBXYZ images as input and outputs the pose of the object, the contact point and normal as well as whether the push will be in contact with the object during the push or not.
Apart from from the latent parameters l and α m , the orientation of the object, θ, is the only state component that cannot be observed directly. Estimating the orientation of an object from a single image would require a predefined "zeroorientation" for each object, which is impractical. Instead, we train the sensor network to predict the orientation relative to the object pose in the initial image of each pushing sequence.

A. Data
We use the data from the MIT Push dataset [38] as a basis for constructing our datasets. Further annotations for contact points and normals as well as rendered images are obtained using the tools described by [39]. However, in contrast to [39], the images we use here also show the robot arm and are taken from a more realistic view-point. As a result, the robot frequently occludes parts of the object, but complete occlusions are rare. Figure 5 shows example views.
We use pushes with a velocity of 50 mm s and render images with a frequency of 5 Hz. This results in short sequences of about five images for each push in the original dataset. We extend them to 20 steps for training and validation and 50 steps for testing by chaining multiple pushes and adding in-between pusher movement when necessary. The resulting dataset contains 5515 sequences for training, 624 validation sequences and 751 sequences for testing.

B. Learning Noise Models
In this experiment, we again evaluate how much the DFs profit from learning the process and observation noise models end-to-end through the filters. In contrast to the KITTI task, for pushing, we expect both heteroscedastic observation and process noise to be advantageous, since the visual observations feature at least partial occlusions and the dynamics of pushing have been previously shown to exhibit heterostochasticity [10].
To test this hypothesis, we compare DFs that learn constant or heteroscedastic noise models to DFs with hand-tuned, constant noise models that reflect the average test error of the pretrained sensor model and the analytical process model.  Table VII again demonstrate that learning the noise models end-to-end through the structure of the filtering algorithms is beneficial. With learned models, all DFs reach much better likelihood scores than with the hand-tuned variants. For the dEKF and especially the dPF, the tracking performance also improves significantly.
Comparing the results between constant and heteroscedastic noise models also confirms our hypothesis that for the pushing task, heteroscedastic noise models are beneficial for both observation and process noise. While all DFs reach the best NLL when both noise models are state-dependent, the effect on the tracking error is, however, less clear.
For dEKF, dUKF and dMCUKF, learning a heteroscedastic observation noise model leads to a much bigger improvement of the NLL than learning heteroscedastic process noise. Similar to the simulated disc tracking task, the input dependent noise model allows the DFs to better deal with occlusions in the observations, which again reflects in a negative correlation between the number of visible object pixels and the predicted positional observation noise.

C. Benchmarking
In the final experiment, we compare the performance of the DFs to an LSTM model on the pushing task. As before, we use a model with one LSTM layer with 256 units. The LSTM state is decoded into an update for the mean and the covariance of a Gaussian state estimate. a) Experiment: All models are trained for 30 epochs using L mix . As initial experiments showed that learning sensor and process model jointly from scratch is very difficult for this task due to the more complex architectures, we pretrain both models. The sensor and process models are finetuned through the DFs and they learn heteroscedastic noise models. The LSTM, too, uses the pretrained sensor model, but not the process model. b) Results: As shown in Table VIII, even with a learned process model, all DFs (except for the dPF-M-lrn) perform at least similar to their pendants in the previous experiment where we used the analytical process model. dEKF, dUKF and dMCUKF even reach a higher tracking performance than before. As noted by [39], this can be explained by the quasistatic assumption of the analytical model being violated for push velocities above 20 mm s . The LSTM model, again, does not reach the performance of the DFs. One disadvantage of the LSTM here is that in contrast to the DFs, we cannot isolate and pretrain the process model. In contrast to the previous tasks, the dPF variant with the learned likelihood function, however, performs even worse than the LSTM for planar pushing. This is likely due to the complex sensor model and the high-dimensional state that make learning the observation likelihood much more challenging.

IX. CONCLUSIONS
Our experiments show that all evaluated DFs are well suited for learning both sensor and process model, and the associated noise models. For simpler tasks like the simulated tracking task and the KITTI Visual Odometry problem, all of these models can be learned end-to-end. Only the pushing problem with its large state and complex dynamics and sensor model requires pretraining to achieve good results.
In comparison to unstructured LSTM models, the DFs generally use fewer weights and achieve better results, especially on complex tasks. While training better LSTM models might be possible for more experienced LSTM users, using the algorithmic structure of the filtering algorithms definitely facilitated the learning problem and thus made it much easier to reach good performance with the DFs. In addition, the structure of DFs allows us to pretrain components such as the process model that are not explicitly accessible in LSTMs.
The direct comparison between DFs with different underlying filtering algorithms showed no clear winner. Only the dPF with learned observation update performed notably better than the other variants on the simulated example task and was least affected by the outlier-trajectory of the KITTItask. This variant relaxes some of the assumptions that the filtering algorithms encode by not relying on an explicit sensor or observation noise model. Its good performance thus shows that the priors enforced by the algorithm choice can also be harmful if they do not hold in practice, such as the Gaussian noise assumption.
Our experiments suggest that for learning the sensor and process model, end-to-end training through the filters is convenient, but provides no advantages over training the models individually. End-to-end training, however, proved to be essential for optimizing the noise models for their respective filtering algorithm. In contrast to end-to-end trained models, both hand-tuned and individually trained noise models did not result in optimal performance of the DFs. Training noise models through DFs also enables learning more complex noise models than the ones used in learning-free, hand-tuned filters. We demonstrate that noise models with full (instead of diagonal) covariance matrices and especially heteroscedastic noise model, can significantly improve the tracking accuracy and uncertainty estimates of DFs.
The main challenge in working with differentiable filters is keeping the training stable and finding good choices for the numerous hyper-parameters and implementation options of the filters. While we hope that this work provides some orientation about which parameters matter and how to set them, we still recommend using the dEKF for getting started with differentiable filters. It is not only the most simple of the DFs we evaluated, but it also proved to be relatively insensitive to sub-optimal initialization of the noise models and was the most numerically stable during training. On the other hand, for tasks with strongly non-linear dynamics, the dUKF, dMCUKF or dPF can, however, ultimately achieve a better tracking performance.
One interesting direction for future research that we have not attempted here is to optimize parameters of the filtering algorithms, such as the scaling parameters of the dUKF or the fixed covariance of the mixture model components in the dPF-M, by end-to-end training. It could also be interesting to implement DFs with other underlying filtering algorithms. For example, the pushing task could potentially be better han-dled by a Switching Kalman filter [40] that explicitly treats the contact state as a binary decision variable. In addition, all of our DFs perform badly on the outlier trajectory of the KITTI dataset which features a much higher driving velocity than the other trajectories we used for training the model. This shows that the ability to detect input values outside of the training distribution would be a valuable addition to current DFs. Finally, it would be interesting to compare learning in DFs to similar variational methods such as the ones introduced by [26,28,33] or the model-free PF-RNNs introduced by [13].

APPENDIX I TECHNICAL BACKGROUND
In the following section, we briefly review the Bayesian filtering algorithms that we use as basis for our differentiable filters.

A. Kalman Filter
The Kalman filter [41] is a closed-form solution to the filtering problem for systems with a linear process and observation model and Gaussian additive noise: The belief about the state x is represented by the mean µ and covariance matrix Σ of a normal distribution. At each timestep, the filter predictsμ t andΣ t using the process model. The innovation i t is the difference between the predicted and actual observation and is used to correct the prediction. The Kalman Gain K trades-off the process noise Q and the observation noise R to determine the magnitude of the update.

B. Extended Kalman Filter (EKF)
The EKF [42] extends the Kalman filter to systems with non-linear process and observation models. It replaces the linear models for predictingμ in Equation S3 and the corresponding observationsẑ in Equation S7 with non-linear models f (·) and h(·). For predicting the state covariance Σ and computing the Kalman Gain K, these non-linear models are linearized around the current mean of the belief. The Jacobians F |µt and H |µt replace A and H in Equations S4 -S6 and S9. This first-order approximation can be problematic for systems with strong non-linearity, as it does not take the uncertainty about the mean into account [43].

C. Unscented Kalman Filter (UKF)
The UKF [44,43] was proposed to address the aforementioned problem of the EKF. Its core idea, the Unscented Transform [44], is to represent a Gaussian random variable that undergoes a non-linear transformation by a set of specifically chosen points in state space, the so called sigma points χ ∈ X. λ = α 2 (κ + n) − n (S10) Here, n is the number of dimensions of the state x. Each sigma point χ i has two weights w i m and w i c . The parameters α and κ control the spread of the sigma points and how strongly the original mean χ 0 is weighted in comparison to the other sigma points. β = 2 is recommended if the true distribution of the system is Gaussian.
The statistics of the transformed random variable can then be calculated from the transformed sigma points. For example, in the prediction step of the UKF, the non-linear transform is the process model (Eq. S13) and the new mean and covariance of the belief are computed in Equations S14 and S15.
In the observation update step, S, K and i from Equations S5, S6 and S7 are likewise replaced by the following: In theory, the UKF conveys the nonlinear transformation of the covariance more faithfully than the EKF and is thus better suited for strongly non-linear problems [45]. In contrast to the EKF, it also does not require computing the Jacobian of the process and observation models, which can be advantageous when those models are learned.
In practice, tuning the parameters of the UKF can, however, sometimes be challenging. If α 2 (κ + n) is too big, the sigma points are spread too far from the mean and the prediction uncertainty increases. However, for 0 < α 2 (κ + n) < n, the sigma point χ 0 , which represents the original mean, is weighted negatively. This not only seems counterintuitive, but strongly negative w 0 can also negatively affect the numerical stability of the UKF [46], which sometimes causes divergence of the estimated mean. In addition, if κ n+κ < 0, the estimated covariance matrix is not guaranteed to be positive semi definite any more. This problem can be solved by changing the way in which Σ is computed (see Appendix III in [47]).

D. Monte Carlo Unscented Kalman Filter (MCUKF)
The UKF represents the belief over the state with as few sigma points as possible. However, finding the correct scaling parameters α, κ and β can sometimes be difficult, especially if the state is high dimensional. Instead of relying on the Unscented Transform to calculate the mean and covariance of the next belief, we can also resort to Monte Carlo methods, as proposed by [48].
In practice, this means replacing the carefully constructed sigma points and their weights in Equations S11 and S12 with uniformly weighted samples from the current belief. The rest of the UKF algorithm remains the same, but more sampled pseudo sigma points are necessary to represent the distribution of the belief accurately.

E. Particle Filter (PF)
In contrast to the different variants of the Kalman filter explained before, the Particle filter Gordon et al. [49] does not assume a parametric representation of the belief distribution. Instead, it represents the belief with a set of weighted particles. This allows the filter to track multiple hypotheses about the state at the same time and makes it a popular choice for tasks like localization or visual object tracking [45].
An initial set of particles χ i 0 ∈ X 0 is drawn from the initial belief and initialized with uniform weights π. In the prediction step, new particles are generated by applying the process model to the old particle set and sampling additive process noise: In the observation update step, the weight π i t of each particle χ i t is updated using current observation z t by A potential problem of the PF is particle deprivation: Over time, many particles will receive a very low likelihood p(z t |χ i t ), and eventually the state would be represented by too few particles with high weights. To prevent this, a new set of particles with uniform weights can be drawn (with replacement) from the old set according to the weights. This resampling step focuses the particle set on regions of high likelihood and is usually applied after each timestep.

APPENDIX II EXTENDED EXPERIMENTS: SIMULATED DISC TRACKING
In the following, we present additional information about the experiments we performed for evaluating the DFs. This includes detailed information about the network architectures for each task, extended results and additional experiments. S1: Sensor model and heteroscedastic observation noise architecture. Both fully connected output layers (for z and diag(R)) get fc 2's output as input.

Layer
Output Size Kernel Stride Activation

A. Network Architectures and Initialization
The network architectures for the sensor model and heteroscedastic observation noise model are shown in Table S1. Tables S2 and S3 show the architecture for the learned process model and the heteroscedastic process noise. We denote fully connected layers by fc and convolutional layers by conv.
For the initial belief, we use Σ init = 25 * I 4 . When training from scratch, we initialize Q and R with Q = 100 * I 4 and R = 900 * I 2 , reflecting the high uncertainty of the untrained models.

B. Implementation and Parameters
All experiments for evaluating different design choices and filter-specific parameters are performed on a dataset with 15 distractors and constant process noise (σ p = 0.1, σ v = 2). The filters are trained end-to-end on L NLL and learn the sensor and process model as well as heteroscedastic observation and constant process noise models. We repeat each experiment two times to account for different initializations of the weights and report mean and standard errors.
1) dUKF: a) Experiment: The original version of the UKF by [44] uses a simple parameterization where α = 1 and β = 0 are fixed and only κ varies. The authors recommend setting κ = 3 − n. α and β are used in the later proposed scaled unscented transform [50], for which [43] suggest setting κ = 0, β = 2 and α to a small positive value.
We evaluate the original, simple parameterization as well as the one for the scaled transform. For the first, we test  In the second case, we evaluate α ∈ {0.001, 0.1, 0.5} but do not vary β, for which the value 2 is optimal when working with Gaussians. b) Results: As discussed in Section VI-B.1 of the main document, the results show no significant differences between the different parameter settings or between using the original parameterization from [44] and the scaled transform. Only for κ < −n, the training failed due to a non-invertible matrix in the calculation of the Kalman Gain.
2) dMCUKF: The results discussed in Section VI-B.2 of the main document are visualized in Figure S1.
3) dPF: Belief Representation: The results discussed in Section VI-B.3 are visualized in Figure S2.

4) dPF: Observation Update:
The likelihood for the observation update step of the dPF can be implemented with an analytical Gaussian likelihood function (dPF-(G/M)) or with a neural network (dPF-(G/M)-lrn) as in [5] and [6]. [5] predict the likelihood based on an encoding of the sensory data and the observable components of the (normalized) particle states. Our implementation, too, takes the 64-dimensional encoding of the raw observations (fc 3 in Table S1) and the observable particle state components as input. However, we decide not to normalize the particles, since having prior knowledge about the mean and standard deviation of each state component in the dataset might give an unfair advantage to the method over other variants. a) Results: Results for comparing the learned and analytical observation update can be found in Figure S2.  Figure S1.
C. Noise Models 1) Heteroscedastic Observation Noise: Table S4 extends  Table I in the main document. It contains results for the dPF-G and on additional datasets with different numbers of distractors and different magnitudes of the positional process noise.
2) Heteroscedastic Process Noise: Table S5 extends Table II in the main document. It contains results for the dPF-G and on additional datasets with different magnitudes of the positional process noise.
3) Correlated Noise: So far, we have only considered noise models with diagonal covariance matrices. In this experiment, we want to see if DFs can learn to identify correlations in the noise. a) Experiment: We create a new dataset with 30 distractors and constant, correlated process noise. The ground truth process noise covariance matrix is

   
We compare the performance of DFs that learn noise models with diagonal or full covariance matrix on datasets with and without correlated process noise. Both the learned process and the observation noise model are also heteroscedastic.    Table S6. Overall, we note that learning correlated noise models has a small but consistent positive effect on the tracking performance of all DFs, even when the ground truth noise is not correlated. On the dataset with correlated ground truth process noise, we also observe an improvement of the likelihood scores.
In terms of the Bhattacharyya distance between true and learned Q, learning correlated models leads to a slight improvement for correlated ground truth noise and to slightly worse scores otherwise. This indicates that the models are able to uncover some, but not all correlations in the underlying data.
In summary, while learning correlated noise models does not influence the results negatively, it also does not lead to a very pronounced improvement over models with diagonal covariance matrices. Uncovering correlations in the process noise thus seems to be even more difficult than learning accurate heteroscedastic noise models. Table S7 extends Table III from Table S8. At each timestep, the input consists of the current RGB image and the difference image between the current and previous image. The network architecture for the sensor model is the same as was used in [4] and [5].

D. Benchmarking
b) Process Model: Tables S9 and S10 show the architecture for the learned process model and the heteroscedastic process noise. For both models, we found it to be important not to include the absolute position of the vehicle in the input values: The value range for the positions is not bounded, and especially for the dUKF variants, novel values encountered at test time often lead to a divergence of the filter.
Excluding these values from the network inputs for predicting the state update also makes sense intuitively, since they are not required for computing the update analytically, either. For the state-dependent process noise, we not only exclude the position, but also the orientation of the car, as any relationships between vehicle pose and noise that could be learned would be specific to the training trajectories.
In addition, we provide the process model with the sine and cosine of θ as input instead of using the raw orientation, to facilitate the learning. In general, dealing with angles in the state vector requires special attention: First, we correct angles to the range between [−π, π] after every operation on the state vector. Second, it is important to correctly calculate the difference between angles (e.g. in the loss function) to avoid differences over 180deg. And third, computing the mean of several angles, e.g. for the particle mean in the dPF, requires converting the angles to a vector representation. c) Initialization: When creating the noisy initial states, we do not add noise to the absolute position and orientation of the vehicle, since the DFs have no way of correcting them. We use diag(Σ init ) = 0.01 0.01 0.01 25 25 for the initial covariance matrix. When training the DFs from scratch, we initialize the covariance matrices Q and R with diag(Q) = 0.01 0.01 0.01 100 100 and R = 100I 2 . This reflects the high uncertainty of the untrained models, but also the fact that the process noise should be higher for the velocities (to account for the unknown driver actions) than for the absolute pose.

B. Training Sequence Length and Filter Parameters
One special feature of the Visual Odometry task is that the the error on the estimated absolute vehicle pose will inevitably grow during filtering. As this could have an effect on the ideal training sequence length, we repeat the experiment from Section VI-D in the main document.
For the dPF-M, we also evaluate different values of the fixed per-particle covariance Σ for calculating the GMMlikelihood. We anticipate that this parameter, too, could be TABLE S6: Results on disc tracking: End-to-end learning of independent (diagonal covariance matrix) or correlated (full covariance matrix) process and observation noise models. We evaluate on one dataset with independent, constant process noise (σ qp = 3.0, σ qv = 2.0), one with independent heteroscedastic process noise (σ qp = 3.0), and one with correlated constant process noise. D Q is the Bhattacharyya distance between true and learned Q.   128 sensitive to the accumulating uncertainty in the problem.
In addition, we also reevaluate different values for parameterizing the sigma point selection and weighting in the dUKF.
1) Training Sequence Length and dPF-M: a) Experiment: We only test with the dEKF, dUKF, dPF-M and dPF-M-lrn on KITTI-10. The filters learn the sensor and process model as well as constant noise models. We train them using L N LL on sequence lengths k ∈ {2, 5, 10, 25} while keeping the total number of examples per batch (steps × batch size) constant.
For the dPF-M, we also evaluate two different values of the per-particle covariance, Σ = I and Σ = 5 2 I. b) Results: The results shown in Figure S4 largely confirm the results obtained for the simulation dataset in Section VI-D of the main document. We again see that longer training sequences increase the tracking performance of all DFs up to a sequence length of around k = 10.
The dUKF seems to be most sensitive to the sequence length, with the highest tracking error and an extremely bad NLL score for sequences of length 2. Different from the simulation experiment, for both dEKF and dUKF, the NLL keeps decreasing strongly over the full evaluated sequence length range, despite the best RMSE already being reached   at k = 5. We attribute this to the accumulating uncertainty about the vehicle pose. For the dPFs, in contrast, the likelihood behaves similarly to the RMSE. In light of the longer training times with higher sequence lengths, we again decide to keep a training-sequence length of 10 when training the DFs from scratch. However, when only the noise models are trained, longer sequences can be used to improved results on the NLL.
For the dPF-M, the experiment also shows that the covariance of the single distributions in the GMM is an important tuning parameter. With Σ = I, we achieve the best tracking error, however, the likelihood does not reach the performance of dEKF and dUKF. The NLL values can be drastically improved by using larger Σ, at the cost of a decreased tracking performance. Visual inspection of the position estimates shows that the particles remain relatively tightly clustered over the complete sequence, such that the likelihood of the GMM is not so different from the likelihood of the individual Gaussian components.
This clustered particle distribution can be explained by the characteristics of the task: The uncertainty in the system mainly stems from the velocity components that are affected by the unknown actions. However, by applying the observation update and resampling the particles at every step, we keep the variance in the velocity components small and thus prevent a stronger diffusion of the unobserved position components. This also explains why the dPF cannot profit as much as the dUKF and dEKF from seeing longer sequences during training.
The large influence of the tuning parameter Σ on the value of the likelihood, independent of the tracking performance, also shows that comparing likelihood scores between different probabilistic models can be difficult. In light of this, we decide to keep using Σ = I for the better tracking error.
2) dUKF: We also repeat the evaluation of different values of the parameters α, κ and β for the dUKF described in Experiment II-B.1. The experiment confirms our finding from the simulation experiment that the exact choice of the values does not have a significant effect on the filter performance. We thus keep the values at α = 1, κ = 0.5 and β = 0.

D. Benchmarking
Table S11 extends the results from Table VI with data for the dPF-G and dPF-G-lrn. Interestingly, we find that the difference in performance between the dPF variants with learned or analytical observation update is not as pronounced as in the results we obtained for the simulation experiment (Section II-B.4). In particular, the dPF-G-lrn performs similarly bad as the dPF-G on this task. S11: Results on KITTI: Comparison between the DFs and LSTM (mean and standard error). Numbers for prior work BKF*, LSTM* taken from [4] and DPF* taken from [5]. BKF* and DPF* use a fixed analytical process model while our DFs learn both, sensor and process model. m m and deg m denote the translation and rotation error at the final step of the sequence divided by the overall distance traveled. b) Process Model: Tables S12 and S13 show the architecture for the learned process model and the heteroscedastic process noise. One problem we noticed is that the estimates for l sometimes diverge during filtering if the DFs estimate that the pusher is in contact with the object while it is not. Just as for the absolute position of the vehicle in the KITTI task, we thus found it important for the stability of the dUKF and dMCUKF to not make the heteroscedastic process noise model dependent on l.
Note that in the filter state, we measure p o and r in millimeter and θ and α m in degree. To avoid having too large differences between the magnitudes of the state components, we downscale l by a factor of 100. n is a dimensionless unit vector and s should take values between 0 and 1.
To keep the filters stable during training, we found it necessary to enforce maximum and minimum values for α m and l. Both α m and l cannot become negative. The opening angle of the friction cone, α m , should also not be larger than 90 • , while we limit l to be in the range of [0.1, 5000] to ensure that the computations in the analytical model remain numerically stable. We use 6-channel RGBXYZ images as input for computing the object position and contact related state components. The object orientation is estimated relative to the initial orientation by comparing the RGB glimpse centered on the current estimated object position to the initial one. White boxes represent tensors, green arrows and boxes indicate network layers, whereas black arrows represent dataflow without processing. For convolution (conv) and deconvolution (deconv) layers, the numbers in each tensor are the kernel size and number of output channels of the layer that produced it. For fully connected layers (fc), the number corresponds to the number of output channels. With the exception of the output layers, all convolution, deconvolution and fully connected layers are followed by ReLU non-linearities. The (de)convolution layers also use layer normalization.  Table S14 extends the results from Table VII with data for the dPF-G. In contrast to the other DF variants, learning complex noise models for the pushing task is not successful for the dPF-G. While the NLL can be further decreased when the noise models are heteroscedastic instead of constant, this comes at the cost of a significantly decreased tracking performance. S14: Results for planar pushing: Translation (tr) and rotation (rot) error and negative log likelihood for the DFs with different noise models (mean and standard error). The hand-tuned DFs use fixed noise models whereas for the other variants, the noise models are trained end-to-end through the DFs. R c indicates a constant observation noise model and R h a heteroscedastic one (same for Q). The best result per DF and metric is highlighted in bold.