Introduction

In computer vision, motion perception is characterized as optical flow estimation: a visual cue refers to the projection of an object’s apparent motion within a scene onto the image plane of a vision system or visual sensor. As a fundamental component of vision analysis, optical flow estimation plays essential roles in various high-level vision tasks, including video understanding [19], behavioral recognition [38], and object tracking [2].

Historical optical flow estimation methods prioritized minimizing intricate energy equations [3, 36], yet were saddled with issues like slow inference speeds, incapability to produce dense optical flows and limitations in handling complex situations. With the advent of deep learning, there has been a notable enhancement in optical flow techniques. FlowNet [6] pioneered the use of convolutional neural networks (CNNs) for dense optical flow estimation, and several approaches based on deep neural networks (DNNs) were subsequently developed. Currently, dense optical flow estimations are divided principally into those that employ supervised and unsupervised learning. Unlike other vision tasks, where ground-truth labels for dense optical flow are rarely available for real image pairs, supervised optical flow techniques are typically trained using synthetic data. However, an inherent mismatch exists between real and synthetic data, leading to a disparity in data distribution that needs to be addressed to overcome this challenge [30]. The drawback of using over-intrinsic label rendering and the limitation of synthetic data enhances the viability of unsupervised optical flow estimation.

Unsupervised learning frameworks use unlabeled videos [51], thus deriving a photometric loss that measures the difference between the target image and the (inversely) warped source image based on the dense flow field predicted by the network. However, image-warping-based unsupervised learning has multiple challenges, including brightness and color variations and motion blur in multi-frame dynamic environments. More importantly, as the images are 2D projections of 3D space, spatial occlusion induced by objects in motion causes pixel information loss over frames; this cannot be recovered by image warping. Such factors mislead the networks and degrade performance.

Several studies sought to solve these problems by designing more robust loss functions [15, 31], using masks to circumvent the supervision of occluded regions [49], tracking occluded pixels in adjacent frames [17]. Based on self-supervised learning, another line of studies augments the images by adding random noise patches to simulate the occluded scenes [23, 26, 29]. However, a prevalent limitation observed was that most existing approaches for estimating optical flow use pairs of images. Although a few methods have employed multiple frames [17, 23], these still use temporally static CNNs without recurrent feedback in the time dimension. Moreover, motion is highly related to temporal dynamics, including factors like illumination shifts, spatial-perspective translations, and motion-induced phenomena such as blur and occlusion, underlining the significance of temporal dynamic relationships. Yet, existing self-supervised learning methods tend to apply arbitrary transformations based on double frames, such as randomly adding regional noise corruption to one image frame to simulate occlusion [23, 39].

Considering the above, we offer that motion perception should be learned in extended temporally dynamic environments to grasp the essential laws of motion. Correspondingly, we constructed a CNN-RNN-based model that estimates optical flow in dynamic environments, namely ULDENet (Unsupervised Learning in Dynamic Environments). We trained the model using temporal causal sequences. To ensure the model adeptly understands the intricacies of object occlusion and the dynamics of scene changes, we incorporated three dynamic training enhancers (DTE) based on self-supervised learning: dynamic occlusion enhancer (DOE), content variation enhancer (CVE), and spatial variation enhancer (SVE). A DOE extracts sub-object blocks from an original data distribution and simulates the random natural motion of objects in multi-frame images, allowing the model to understand object occlusions over prolonged times. We use a mixed supervision strategy, combining unsupervised and self-supervised losses when training the DOE; this simultaneously and reliably supervised occluded regions and occlusions per se, thus enhancing generalization to occlusion scenes. A CVE simulates variations in image attributes, such as object brightness and chromaticity dynamics over time. Likewise, CVE also simulates regional motion blurring and defocus blurring caused by high-speed movement, enhancing model robustness when dealing with such challenging scenes. Finally, an SVE simulates spatial object variations caused by camera shake and vibration in dynamic environments, as well as gradual translation/rotation of view perspective.

In terms of model structure, we draw inspiration from the temporal predictive coding of the human visual system [28, 35], where higher order neurons send feedback signals to lower order neurons. We exploit both spatial and temporal recurrences, thus building a spatial-temporal dual recurrent block with a predictive coding arrangement. By leveraging the temporal smoothness of motion, temporal dynamic networks can propagate prior motions through hidden states to provide more reliable references for untraceable pixels. Besides, motion estimation of an occluded object is more-like predicting the location thereof in the next frame rather than searching correspondence, which fits well with the predictive coding mechanism and is one motivation for introducing temporal dynamics.

Ultimately, our network, with few parameters and efficient inference speed, exhibits stable convergence over sequences of variable lengths and outperforms counterparts in ablation studies. Our strategy is motivated by many recent studies, including [9, 18, 20, 23]. The contributions of this study are:

  • We leverage temporal dynamic modeling to aid unsupervised learning-based optical flow estimations and design an efficient spatiotemporal dual recurrent structure that recursively processes real-time video sequences with arbitrary lengths. The outcomes are comparable to several state-of-the-art models, with low memory and computational overhead advantages.

  • We built three self-supervised training enhancers for multi-frame dynamic environments. These substantially improve model generalization in challenging environments; Sintel dataset errors are reduced by 20%. The method is flexible and compatible with multi-frame methods without causing additional computations or overhead of memory.

  • We use two strategies to solve the occlusion problem, i.e., simulation of occlusion phenomenon by dynamic moving occluder based on Markov property and temporal smoothing regularization based on motion’s temporal continuity, thus alleviating the problem posed by inadequate supervision of occluded regions.

The remainder of this work is organized as follows. In Sect. “Related work”, recent trends in unsupervised learning-based optical flow estimation are briefly reviewed. The basics of unsupervised learning and our notational conventions are introduced in Sect. “Preliminaries and notation”. The network is introduced in Sect. “The basic approach”. Section “Dynamic training enhancement” presents the three self-supervised learning-based training enhancers. Experiments and ablation studies are described in Sect. “Experiments”. Finally, we conclude the study in Sect. “Conclusion and limitations”.

Related work

Supervised methods require annotated flow ground truths to train networks. FlowNet [6] was the first work to propose learning optical flow estimation by training a fully convolutional network on the synthetic dataset, FlyingChairs. Then, FlowNet2 [14] proposed stacking multiple networks as an iterative improvement. SpyNet [34] is a spatial pyramid network that estimates optical flow in a coarse-to-fine manner and handles challenging scenes with large displacements. PWCNet [41] and LiteFlowNet [11] warp features and calculate the cost volumes of all pyramid layers from the intermediate feature maps of CNNs. IRR-PWC [13] designs pyramid networks using iterative residual refinement schemes. Recently, RAFT [43] proposed to estimate flow fields by 4D correlation volume and recurrent network, yielding an excellent performance. The CSFlow [37] further leveraged cross-strip correlation for optical flow prediction tailored to autonomous driving scenarios and achieved higher performance.

Unsupervised approaches circumvent the need for labels by optimizing photometric consistency with some regularizations. Yu et al. [51] were the first to present a method for learning optical flow based on the constancy of pixel brightness and motion smoothness. Following several studies improved the performance step-by-step, such as adding occlusion reasoning [31, 49], multi-frame extension [10, 18, 42], epipolar constraints [54], 3D geometrical constraints [50, 55] and stereo depth [22, 48]. Based on self-supervised learning, another branch of works improve performance by learning the flow of ambiguous regions in a knowledge distillation manner [23, 24, 26]. There is a growing tendency to improve performance by integrating multiple strategies [20, 29].

Occlusion remains a fundamental issue during unsupervised learning. Usually, an occlusion mask is employed to remove the photometric losses of occluded regions [49]; thus, such regions are never supervised. Some studies have proposed solutions from a 3D geometric perspective. NccFlow [46] introduced two novel loss functions for unsupervised optical flow learning based on the geometric laws of non-occlusion. DOPlearning [47] jointly modeled unsupervised training of optical flow, depth, and pose. In occluded regions, depth and camera motion can provide more reliable motion estimation, guiding unsupervised optical flow learning. Additionally, using self-supervised learning, DDflow [24] introduced the concept of using rectangular noise patches to simulate occlusion phenomena. This idea was further refined by Selflow [26], which employed a super-pixel algorithm to generate occlusions with a more natural shape. This strategy of randomly adding noise to images has been widely adopted in recent studies [27, 29, 39].

However, we hold the viewpoint that occlusion is a long temporal process influenced by interactions among multiple moving objects across multiple frames. Abrupt masking through random noise, as utilized in the aforementioned approaches, does not accurately reflect the smooth motions of objects in natural scenes that adhere to the inertia criterion. Furthermore, images contaminated by regional random noise could be considered outliers-outside the distribution of the original dataset-which might potentially degrade network performance when utilizing the original dataset. Considering this issue, we address the problem by modeling smooth and dynamic motion scenarios. We achieve this by extracting sub-objects from the original data distribution and simulating smooth dynamic occlusions based on the Markov property. By implementing a well-designed mixed loss function, we attain complete supervision of both the flows within occluded regions and the occluder per se.

Temporal dynamic models based on RNNs are not commonly used for optical flow estimation. A few studies sought to model multi-frame optical flow estimations using RNNs and LSTMs [9, 33]. However, these approaches rely on conventional supervised training and are applicable to only a few synthetic sequences. Several unsupervised learning approaches employ multi-frame data to estimate optical flow [10, 17, 23, 26]. However, these methods remain temporally static and thus cannot handle real-time causal sequences with arbitrary lengths. Temporal stasis is a quality of a technique that does not change over time or incorporate time as a variable. In other words, it assumes that the system is stationary/unchanging over time. Most existing models are static because they do not incorporate feedback into the time dimension, unlike temporal RNNs, and do not maintain the historical motion context over long periods of time, although such maintenance is helpful when predicting the locations of occluded regions.

In this work, we construct a spatial-temporal dual RNN that learns optical flow in an unsupervised manner over a long duration.

Preliminaries and notation

Here, we first briefly introduce the general framework of our unsupervised optical flow and then define the notation used. Let a casual RGB image sequence be: \({\mathcal {I}}=\{I_0,I_1,\ldots ,I_{t-1},I_{t}\}\). We seek to train a model that estimates current optical flow \(F_{t-1\rightarrow t} \in {\mathbb {R}}^{H \times W \times 2}\) from \({\mathcal {I}}\), i.e., and \(F_{t-1\rightarrow t} = f ({\mathcal {I}};\varTheta )\), where \(\varTheta \) is a set of trainable parameters.

We employ a general warp-based strategy in our basic unsupervised learning approach, by which a network is implicitly trained via view synthesis. Specifically, each image \(I_t\) is reconstructed by reference to the next frame \(I_{t+1}\) via inverse warping, i.e., \({\hat{I}}_t = I_{t+1}({\textbf{p}}+F_{t\rightarrow t+1}({\textbf{p}}))\), where \({\textbf{p}}\in {\mathbb {R}}^{H\times W}\) is the spatial position across the entire image. Then, we need only to define the similarity evaluation function \(\rho (\cdot )\) between the reconstructed image \({\hat{I}}_t\) and the original image \({I}_t\), i.e., \({\mathcal {L}}_{\textrm{ph}} \sim \sum _{{\textbf{p}}} \rho ({\hat{I}}(\varTheta ), I)\); this implicitly supervises the optical flow without using label. The basic similarity evaluation loss is L1, i.e., \({\mathcal {L}}_{\textrm{ph}} \sim \sum _{{\textbf{p}}} |{\hat{I}}(\varTheta )- I|\), which simply considers pixel-wise similarities between pairs of images. More robust losses that focus on structural similarities, i.e., the SSIM loss and census loss [31], are widely used to overcome variations in color and brightness across frames. In this work, we adopted census loss based on census transformation [53].

Fig. 1
figure 1

Inference architecture of the spatial-temporal dual recurrent block. The spatial recurrence is modified from the basic flow inference pipeline of PWCNet [41], including a cascade of image warping, volume correlation, flow estimation, and context networking. This decodes the flow field from small to large scales. Using a self-guided, warping-based gate recurrent unit (SGW-GRU) block, we introduce temporal recurrence via a predictive coding structure that transmits feedback from deep to shallow layers. The subfigure (C) demonstrates that the network infers temporal causal sequences of arbitrary length. Please see Sect. “Network structure” for the details

Given the aperture problem and ambiguities of local appearances, supervision based solely on photometric loss does not sufficiently constrain the problem posed by relatively textureless or repetitive patterns. Edge-aware first and second-order smooth regularization are commonly used to reduce ambiguity [45]. Given a frame \(I_{t-1} \) with a flow field \(F_{t-1\rightarrow t}\), the k-order edge-aware loss \({\mathcal {L}}_{s m(k)}\) is

$$\begin{aligned} \begin{aligned}&{\mathcal {L}}_{s m(k)}\sim \sum _{\textbf{p}}\left( w_{\textbf{p}}^x\cdot \left| \frac{\partial ^{k} {F}_{t-1\rightarrow t}}{\partial x^{k}}\right| _{\textbf{p}}+w_p^y\left| \frac{\partial ^{k} {F}_{t-1\rightarrow t}}{\partial y^{k}}\right| _{\textbf{p}}\right) ,\\&w_{\textbf{p}}^x = \exp \left( -{\lambda } \sum _{c}\left| \frac{\partial I_{t-1}}{\partial x}\right| _{\textbf{p}}\right) ,\\&w_{\textbf{p}}^y = \exp \left( -\lambda \sum _{c}\left| \frac{\partial I_{t-1}}{\partial y}\right| _{\textbf{p}}\right) , \end{aligned} \end{aligned}$$
(1)

where \(w_{\textbf{p}}^x\), \(w_{\textbf{p}}^y\) is the attenuation weight determined by the smoothness of the first-order image gradient. This intuitively adds further smoothness to the flow fields of areas with similar pixels. Guided by previous work [20, 23], we apply first-order smooth loss to the Sintel dataset and second-order smooth loss to the KITTI dataset; \(\lambda \) is set to 30 for all datasets.

Motion occlusion is challenging during unsupervised learning and creates pixel-wise information loss from \(I_t\) to \(I_{t-1}\). The most common solution is to circumvent the loss calculation for the occluded region by inferring the binary occlusion mask \(O_{t-1\rightarrow t} \in \{0,1\}^{H\times W}\) from \(I_{t-1}\) to \(I_t\), where \(O_{t-1\rightarrow t}({\textbf{p}}) = 0 \) indicates that locations \({\textbf{p}}\) of \(I_{t-1}\) are occluded in \(I_{t}\) and vice versa. Guided previous methods [20, 23], we incorporate a forward–backward check [16] into our basic approach when estimating the occlusion mask.

We can simplify the notation as follows. Given two adjacent time steps \(\{a,b \mid b>a\}\), we denote forward flow as \(F_{a\rightarrow b}\) as \(F_{a}\) and backward flow as \(F_{b\rightarrow a}\) as \(F^{-1}_{b}\); \({\textbf{p}} \in {\mathbb {R}}^{H\times W} \) is the spatial location set across a specific image; \({\mathcal {W}}_{a}(\cdot ):I_{b}\longmapsto {\hat{I}}_{a} \) is the warping operation from \(I_b\) via \(F_{a \rightarrow b }\); and \({\textbf{E}}_{\textbf{p}}(\cdot )\) are the statistical expectations along \({\textbf{p}}\).

The basic approach

In this section, we introduce the spatial-temporal dual recurrent network and our basic unsupervised training approach, which uses unlabeled temporal causal sequences. A novel form of occlusion-aware temporal smoothness regularization (TSR) is introduced in Sect.  “Occlusion-aware temporal smoothness regularization”.

Network structure

We constructed a dynamic RNN that handles arbitrary temporal series \({\mathcal {I}}=\{I_0,I_1,\ldots ,I_{t-1},I_{t}\}\), receives the previous frame \(I_{t-1}\), current frame \(I_{t} \), and previous hidden state \(H_{t-2}\) as inputs, and outputs the current optical flow \(F_{t-1}\) and hidden state \(H_{t-1}\) for the next time step. Given a pair of images, a five-level feature map pyramid \({\mathcal {P}} = \{I^5,\ldots ,I^1\}\) is generated by a concise feedforward CNN for each image, with the size gradually reducing from \(\frac{H}{4}\times \frac{W}{4} \) to \(\frac{H}{64}\times \frac{W}{64} \).

To ensure spatiotemporal recurrence, we introduce a convolutional GRU block with self-guided warping (SGW), i.e., an SGW-GRU block. The SGW-GRU block operates directly on feature pyramids from low- to high-scale in a spatially recurrent manner. Likewise, the block operates across image sequences in a temporally recurrent manner. For a feature pyramid at level l, it has four basic stages of PWCNet [41], i.e., warping \(\rightarrow \) correlation volume \(\rightarrow \) flow estimator \(\rightarrow \) context network, as shown in Fig. 1 (A). The difference is that an extra convolutional block is added to the context network to generate a temporal hidden state \(H_{t-1}\) characterized by 32-channel feature maps at different scales, which implicitly contain contextual information of motion history during previous moments. The current temporal hidden state is used as input for the next time stage when it is fed back into the SGW-GRU block.

Given moving objects, there are always differences in spatial locations between the features of two adjacent time steps. Accordingly, we use the SGW block to self-adjust the spatial locations of hidden feature maps. In such a block, the correlation between a previously hidden and current feature is first calculated, and a mini-flow estimator is then employed to derive a flow field that guides warping when adjusting the spatial arrangement of \(H_{t-2}^{l+1}\). Finally, a convolutional GRU fuses the hidden state \(H_{t-2}^{l+1}\) to the current features \(I_{t-1}^l\) and outputs the fused feature \(I_{new}\), as shown in the Fig. 1 (B). The module of Conv-GRU applies the basic principle of the original GRU [5] but replaces the core operation with a fully convolutional approach. Specifically, the module accepts the previous high-level hidden state \(H_{t-2}^{l+1}\) and current feature \(I_{t-1}\) as inputs, and generates a fused feature \(I_{new}\) that is input to the flow estimator. This may be formalized as

$$\begin{aligned} \begin{aligned}&z=\sigma \left( {\text {Conv}}_{3 \times 3}\left( {\text {ConCat}}\left[ H_{t-2}^{l+1}, I_{t-1}\right] \right) \right) \\&r=\sigma \left( {\text {Conv}}_{3 \times 3}\left( {\text {ConCat}}\left[ H_{t-2}^{l+1}, I_{t-1}\right] \right) \right) \\&q={\text {Tanh}}\left( {\text {Conv}}_{3 \times 3}({\text {ConCat}}[r\odot I_{t-1} , H_{t-2}^{l+1}])\right) \\&I_{new}=\left( 1-z\right) \odot I_{t-1}+z \odot q. \end{aligned} \end{aligned}$$
(2)

where \(\sigma (\cdot ) \) denotes the sigmoid function; \({\text {ConCat}}(\cdot )\) the concatenation of feature maps; and \(\odot \) the element-wise production.

Fig. 2
figure 2

Occlusion-aware temporal smoothness regularization. Scenario A: The pixels in a current frame \(I_c\) become invisible in the next frame \(I_f\); Scenario B: The pixels visible in the current frame \(I_c\) are invisible in the previous frame \(I_p\). Scenarios A and B cause errors in spatial alignment, which should be considered during temporal smoothness regularization. The right side shows the Sintel Clean scene. We impose a temporal smoothing constraint on the current flow field \(F_c\) by tracking motion on previous \(F_p\) and future \(F_f\) frames, assuming constant velocities over short times. As motion is smooth, forward and backward occlusions are usually complementary, making it possible to track the prior motion using \(F_p\) or \(F_f\), as shown in the region of the red dashed box. For more details, please see Sect. “Occlusion-aware temporal smoothness regularization

Regarding temporal recurrence, the previous feedback signal from the deeper layer is transferred to the current timestep to control the coding behaviors. This temporal feedback provides prolonged motion history contextual information, which could facilitate predicting the current motion stage, as demonstrated on the right side of Fig. 1 (C). The such design mirrors the temporal predictive coding of the human visual system [28, 35], wherein higher order neurons send feedback signals to lower order neurons and thus control their behaviors. This mechanism implicitly forces neurons to learn motion perception in a dynamic environment [40]. Integration of the above modules yields a lightweight network with only 2.50 M parameters, which handles causal sequences of arbitrary length to estimate optical flow.

Occlusion-aware temporal smoothness regularization

As mentioned in Sect. “Introduction”, simply masking the occluded region is inadequate in terms of supervision. Spatial smoothness loss evaluates only low-level RGB similarity. As compensation, we propose temporal smoothness regularization (TSR) to provide more reliable supervision for occluded regions. First, we define four adjacent frames \(\{ I_{k-1}, I_{k}, I_{k+1}, I_{k+2} \}\) and three motion flow fields \(\{F_{p}, F_{c}, F_{f} \}\), i.e., the forward optical flows of \(I_{k-1}\rightarrow I_{k}\), \(I_{k}\rightarrow I_{k+1}\), and \(I_{k+1}\rightarrow I_{k+2}\). Given the limitations imposed by physical laws, the motion trajectory M(t) of an object is always a smooth differential curve, and a video sequence can be regarded as a sampling of the locations thereof. Given a sufficiently short sampling interval \(\varDelta t\), we can create a small temporal window and then derive a Taylor first-order approximation of M(t), i.e., \(M(t) = M(t_0) + \varDelta t \cdot \left. \frac{\textrm{d} S}{\mathrm {~d} t}\right| _{t=t_0}\), or, \(M(t) = M(t_0) + t \cdot V(t_0)\), where V can be regarded as a velocity vector or the so-called optical flow within the defined temporal window. With the above assumption, the identical object could be regarded as moving at a similar speed across \(\{F_{p}, F_{c}, F_{f} \}\). Thus a triple equation is established based on reverse warping operation:

$$\begin{aligned} \begin{aligned}&{\hat{F}}_c^{f} = F_{f}(p+F_{c}(p)),~{\hat{F}}_c^{p} = F_{p}(p+F^{-1}_{p}(p)).\\&{\hat{F}}_c^{p} \approx {\hat{F}}_c\approx {\hat{F}}_c^{f} \end{aligned} \end{aligned}$$
(3)

where \({\hat{F}}_c^{f}\) is the \(F_c\) reconstructed from \(F_f\); \({\hat{F}}_c^{p}\) is the \(F_c\) reconstructed from \(F_p\) \(F^{-1}_{p}\) denotes the backward flow of \(I_{k}\rightarrow I_{k-1}\). Maintaining the assumption of constant velocity, above equation yields the basic temporal smoothness constrain. However, importantly, warping is not reversible and an occlusion problem thus arises during either forward or backward optical flow, as shown on the left of Fig. 2. Typically, there are two problematic scenarios. In the first, pixels in the current frame become invisible in the next frame (A). In the second scenario, pixels visible in the current frame are invisible in the previous frame (B). Both scenarios cause errors in spatial alignment. For simplicity, \(\varOmega _A\), \(\varOmega _B\) indicate two locations where pixels of \(F_c \) are invisible in \(F_p\) and \(F_f\), respectively. Thus, we generate two occlusion masks \(O_A, O_B \) via forward–backward checking:

$$\begin{aligned} O({\textbf{p}})= {\left\{ \begin{array}{ll} 1, &{} {\textbf{p}}\notin \varOmega \\ 0 &{} {\textbf{p}}\in \varOmega \end{array}\right. } \end{aligned}$$
(4)

Temporal smoothness depends on computation of the differential across time, which, in the discrete case, is equivalent to computation of the first-order differences across frames. Here, we simplify this by directly using the Charbonnier loss to evaluate the differences between frames, thus ensuring smoothness between \(F_c\) and the two adjacent frames. Also, the displacement size of the \(F_c\) is utilized as the decay coefficient for temporal smoothing, i.e., locations with more violent motion have smaller smoothing constraints. The final temporal smoothing regularization of the current flow \(F_c\) is

$$\begin{aligned} \begin{aligned} \ell _{tsm} \sim&\frac{\sum _{{\textbf{p}}} \left[ {\textbf{C}}\left( {{\mathcal {S}}}({\hat{F}}_c^{p}), F_c\right) \odot O_A \bigg / |F_c| \right] }{\sum _{{\textbf{p}}}O_A({\textbf{p}})} \\&+ \frac{\sum _{{\textbf{p}}} \left[ {\textbf{C}}\left( {{\mathcal {S}}}({\hat{F}}_c^{f}), F_c\right) \odot O_B\bigg / |F_c|\right] }{\sum _{{\textbf{p}}}O_B({\textbf{p}})}. \end{aligned} \end{aligned}$$
(5)

where \({\mathcal {S}} (\cdot )\) means to stop the gradient from the computational graph and \({\textbf{C}} (\cdot )\) is the Charbonnier loss. At locations \({\textbf{p}} \in \{\varOmega _A \cap \varOmega _B\} \), we do not perform any TSR because the adjacent frames contain no reliable prior flows. At positions \(\{ {\textbf{p}} | {\textbf{p}} \in \varOmega _A, {\textbf{p}} \notin \varOmega _B \}\), we add only temporal smoothness from the previous flows \(F_p\). Similarly, temporal smoothness from future flows \(F_f\) are available only for positions \(\{ {\textbf{p}} | {\textbf{p}} \in \varOmega _B, {\textbf{p}} \notin \varOmega _A \}\); Ultimately, the temporal smoothness values from both \(F_p\) and \(F_f\) are added to positions \(\{ {\textbf{p}} | {\textbf{p}} \notin \varOmega _A \cup \varOmega _B \}\). The optimized result is that the \(I_c\) approximates the linear interpolations between \(I_p\) and \(I_f\).

Our design includes an a prior assumption. Given the temporal smoothness of motion, \(\varOmega _A\) and \(\varOmega _B\) tend to have only a few intersections, so objects occluded in the next frame are often visible in the previous frame. Therefore, motion information occluded via forward propagation can be supplied by examining past flow. On the lower right side of Fig. 2, the background area occluded by the arm in the forward inference is visible in the backward inference; the motion information in the previous frame can be tracked using backward flow as the reference. The temporal smoothness constraint affords more reliable supervision of occluded regions than does spatial smoothness, as shown in Table 5

Training in the video sequence

During each training iteration, we sequentially feed N consecutive frames into the network and simultaneously supervise the \(N-1\) optical flows generated. All strategies in Sect. “Preliminaries and notation” are used to construct the loss function. Thus, we employ warping-based image photometric similarity, spatial smoothing loss, and occlusion masking. Note that, as we employ a temporally recursive network, we infer causal sequences and generate optical flow in only one direction. Therefore, during training, we first infer the forward optical flow of the entire sequence and then reverse the image sequence to infer backward optical flow. Finally, a forward–backward check is performed to obtain the occlusion mask sequence \(\{O_t\}_{t=1}^{N-1}\). Specifically, at the end of the sequence, we update the weights to decrease the following loss:

$$\begin{aligned} {\mathcal {L}}=\frac{1}{N-1} \sum _{t=1}^{N-1} {\mathcal {L}}_{t} \end{aligned}$$
(6)

where \({\mathcal {L}}_t \) is the combination of the multi-scale loss for image pair \((I_t, I_{t+1})\) and the temporal smoothness loss:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_t= \,&\lambda _{1}\ell _{sm}(F_t)+\lambda _{2}\ell _{tsm}(F_{t-1},F_t,F_{t+1})\\&+\ell _{warp}(I_t,I_{t+1},F_t,O_t). \end{aligned} \end{aligned}$$
(7)

The \( \lambda _{1}\) and \(\lambda _{2}\) hyperparameters balance the loss weights of the loss function.

Dynamic training enhancement

As indicated by DDFlow [25], self-supervised learning-based distillation effectively improves unsupervised optical flow estimations and has been widely adopted [20, 23, 26, 29]. Generally, a teacher model \(f_t(\cdot ) \) processes original pairs of samples \({\mathcal {I}} = \{I_a, I_b\}\) to generate the optical flow as an pseudo-label \(F_a^*\) that trains a student model \(f_s(\cdot ) \). Before \({\mathcal {I}}\) is fed to \(f_s(\cdot )\), a series of specific image transformations \({\mathcal {T}}_I(\cdot ): {\mathcal {I}} \longmapsto \widetilde{{\mathcal {I}}}\) are used to increase scenic diversity. Such data augmentation generates a lower confidence optical flow \( F_a\), which is then self-supervised using the pseudo-label from \(f_t(\cdot )\). To ensure spatial consistency, the pseudo-label must be consistently transformed using \({\mathcal {T}}_F(\cdot ): F^* \longmapsto \widetilde{F^*}\). Then, self-supervision based on the pseudo-label is achieved by considering the Charbonnier similarity between pseudo-label \({\mathcal {T}}_F(F_a^*)\) and \(F_a\), thus

$$\begin{aligned} \ell _{self} \sim {\textbf{E}}_{{\textbf{p}}}\left( \left| {\mathcal {S}}\left( {F}_{a}({\textbf{p}})\right) -\widetilde{F_{a}^{*}}({\textbf{p}})\right| +\epsilon \right) ^{q} \end{aligned}$$
(8)

Here, we extend this strategy to multi-frame sequences and simulate multiple dynamic natural variations. The network comes to understand occlusion and variation in a dynamic environment. Specifically, we use three training enhancers, i.e., a DOE, an SVE, and a CVE, as described below.

Dynamic occlusion enhancer

We aim to simulate more naturalistic occlusion hallucinations in the DOE. There are two steps, the first of which is (i) random cropping. This operation is critical as, in videos, objects often move out of the frame due to their inherent motion, resulting in an inability to find correspondence between consecutive frames, which could be termed “out-of-boundary” occlusions. By reducing images to a smaller dimension through cropping, we can simulate this type of occlusion. We crop the image sequences during preprocessing prior to occlusion transformation. The second step is (ii) dynamic occlusion simulation. Given a set of frames, we divide the images into multiple subregions using a super-pixel segmentation algorithm [1]. Subsequently, n subregions are randomly selected as “occluders”. Rather than adding random noise, we create a natural texture for each occluder by extracting the textures of the original sample batches. Finally, we simulate occlusions by randomly placing n occluders into the first frame of a sequence.

Simulation of dynamic motion

Unlike previous studies, we dynamically simulate the motions of occluders in multi-frame sequences. Obviously, object motion is always smooth in nature, and motion at the next time step depends on the motion of the moment. To better simulate random motion while obeying the above rule, we leverage the Markov chain. Specifically, we assume that random object movement \({\textbf{S}} (t)\) can be decomposed into two orthogonal vectors \({\textbf{S}}(t) = [U(t),V(t)]\). For any time step t, we introduce Markov properties by assuming that the motion state depends only on the motion in the previous time step \(t-1\):

$$\begin{aligned} {\text {Pr}}\left[ {\textbf{S}}(t)\right] ={\text {Pr}}[{\textbf{S}}(t)=s_t \mid {\textbf{S}}(t-1)=s_{t-1}] \end{aligned}$$
(9)

The motion states [UV] at time t are sampled from the 2D Gaussian distributions:

$$\begin{aligned} {[}U(t),V(t)] \sim {\mathcal {N}}( \varvec{\mu },\varvec{\varSigma }) \end{aligned}$$
(10)

where we let

$$\begin{aligned} \varvec{\mu }=\left( \begin{array}{l} \mu _{U}(t) \\ \mu _{V}(t) \end{array}\right) , \quad \varvec{\varSigma }=\left( \begin{array}{cc} \sigma _{U}^{2} &{}\quad \\ &{}\quad \sigma _{V}^{2} \end{array}\right) , \end{aligned}$$
(11)

by (simplistically) assuming that U(t) and V(t) are independent. \(\sigma _{U},\sigma _{V}\) are constants that control variations in motion and the mean \(\mu _{U},\mu _{V}\) are set to the motion state at the prior moment, i.e., \([\mu _{U}(t),\mu _{V}(t)]^T = [U(t-1), V(t-1)]^T \). In this manner, random smooth motion is simulated for each artificial occluder. For simplicity, we denote the transformation of the above process as \({\mathcal {T}}_I^O(\cdot ): \{I_t\} \longmapsto \{\widetilde{I^{{\mathcal {O}}}_t}\}\) for the image sequence, and as \({\mathcal {T}}_F^O(\cdot ): \{F^*_t\}\longmapsto \{\widetilde{ F^*_t}\}\) for the pseudo-label. Visualizations of \({\mathcal {T}}_I^O(\cdot )\) and \({\mathcal {T}}_F^O(\cdot )\) are shown in Fig. 3.

Fig. 3
figure 3

Mixed supervision strategy with dynamic occlusion. To simulate natural object motion, we use a super-pixel algorithm to extract sub-objects of images that serve as naturally textured artificial “occluders”, and then apply a Markov process to simulate the smoothness of random motion, the effect of which is shown in the bottom of the figure (where the occluders are in yellow). Using the known prior occlusion masks, we combine self-supervised and unsupervised learning to ensure mixed supervision (blue box). The details are in Sect. “Dynamic occlusion enhancer

Fig. 4
figure 4

Qualitative demonstration of temporal dynamic variations, including content, spatial variations and dynamic occlusions. We perform dynamic transformations on the original image sequence (A), such as the monotonic increase in chromaticity and illumination in (B); dynamic shifts and warps in spatial positions (C); and dynamic movements of the artificial occluders in a Markov process (D). Using self-supervised distillation, all these dynamics are well handled by the model. See Sect. “Dynamic training enhancement” for the details of dynamic occlusion 5.1 and content and spatial variations 5.2

Mixed supervision

At the moment t, each occluder \(\{\varPsi _n^t\}_{n=1}^{N}\) is regarded as the closest object to the lens at the highest level of occlusion. It is then not difficult to generate the occlusion mask \({O}_t\):

$$\begin{aligned} O_t({\textbf{p}})= {\left\{ \begin{array}{ll} 1, &{} {\textbf{p}}\notin \varOmega _{\varPsi } \\ 0, &{} {\textbf{p}}\in \varOmega _{\varPsi } \end{array}\right. } \end{aligned}$$
(12)

where \(\varOmega _{\varPsi } = \{\varPsi _1^t \cup \varPsi _2^t \cup \dots \cup \varPsi _N^t \}\). As shown in Fig. 3, we designed a mixed supervision strategy to drive model learning occlusion laws in simulated dynamic occlusion sceneries, formalized as

$$\begin{aligned} \ell _{doe}(t) = \ell _{1}(t)+ \ell _{2}(t), \end{aligned}$$
(13)

where \(\ell _{1}\) avoids losses caused by the occluders, using the pseudo-label to supervise regions without occluders:

$$\begin{aligned} \ell _{1}(t) \sim {\textbf{E}}_{{\textbf{p}}}\left( {\textbf{C}}\left( \widetilde{F^*_t}({\textbf{p}}) , F_t({\textbf{p}})\right) \odot O_t({\textbf{p}})\right) , ~{\textbf{p}} \notin \varOmega _{\varPsi }. \end{aligned}$$
(14)

\(\ell _{2}\) is an unsupervised loss based on image warping, which is available for the regions \({\textbf{p}} \in \varOmega _{\varPsi }\):

$$\begin{aligned} \ell _{2}(t) \sim {{\text { SSIM}}} \big ( {\mathcal {W}}_t (I_{t+1}),I_t \big ) \odot \left( 1-O_t\right) +\ell _{sm}, \end{aligned}$$
(15)

where \( {\textrm{SSIM}}(\cdot ) \) operates only on regions with occluders by the mask \(1-O_t\). Here, \(\ell _{sm}\) is similar to equation 1, but the image gradient-based attenuation term is modified by replacing the image \(I_t\) with the occlusion mask \(O_t\), thus restricting smoothing regularization within subregions of each occluder. The DOE with mixed loss strategy are evaluated in the ablation studies (see Table 7).

Spatial and content variation enhancements

The SVE implements a series of spatial transformations, including random sequence cropping, rotation, horizontal and vertical flipping, and thin-plate-spline or CPAB transformations [7], which can actually be regarded as spatial data augmentation. We extend the spatial transformations to continuous time series and simulate them across time. Natural image sequences exhibit spatial jitter and distortion attributable to environmental vibrations and camera shake. Several such challenging scenes are included in the Sintel Final dataset [4]; we generalized the spatial variations by applying a series of random affine transformations along the sequence to simulate variations and performed object rotation, distortion, scaling, and translation. Such spatial transformations (except cropping) shift the pixel positions of the entire sequences by \({\mathcal {T}}^S_I: \{I_t\}\longmapsto \{\widetilde{I_t^{{\mathcal {S}}}}\}\), which can also be equalized as

$$\begin{aligned} \{{\widetilde{I}}_{t}^{{\mathcal {S}}}({\textbf{p}})\}=\big \{I_{t}\big (\tau _{\theta _t}({\textbf{p}})\big )\big \}, \end{aligned}$$
(16)

where \(\tau _{\theta _t}(\cdot )\) is the transformation of the pixel coordinates of image \(I_t\) using the transform parameters of \(\theta _t\). To maintain consistency between the transformed scenes and pseudo-label \(F^*_t\), the optical flows must undergo different transformations because the application of different transformation parameters to \(I_t \) and \( I_{t+1}\) varies the optical flow fields. Here, we first track the flow variation between \(\{\tau _{\theta _t}, \tau _{\theta _{t+1}}\}\) via inverse-affine transformation, superimpose this on the offset of the original flow field \(F^*_t\), and finally use \(\tau _{\theta _t}(\cdot )\) to ensure spatial consistency with \({\widetilde{I}}_{t}\). The entire process \({\mathcal {T}}^S_F: \{F_t^*\}\longmapsto \{\widetilde{F^*_t}\}\) can be formalized as

$$\begin{aligned} \left\{ \begin{array}{l} F_{\textrm{new}}({\textbf{p}})=\tau _{\theta _{t+1}}^{-1}\left( {\textbf{p}}+F^*_{t}({\textbf{p}})\right) -\tau ^{-1}_{\theta _t}({\textbf{p}}) \\ \widetilde{F_t^*}({\textbf{p}})=F_{\textrm{new}}\left( \tau _{\theta _t}({\textbf{p}})\right) \end{array}\right. \end{aligned}$$
(17)

Given the pseudo-label \(\widetilde{F_t^*}({\textbf{p}})\), the loss function based on self-supervised learning by the SVE is

$$\begin{aligned} \ell _{sve}(t) \sim {\textbf{E}}_{{\textbf{p}}}\left( {\textbf{C}}\left( \widetilde{F^*_t}({\textbf{p}}) , F_t^{{\mathcal {S}}}({\textbf{p}})\right) \right) \end{aligned}$$
(18)

Similar to the SVE, the CVE performs various transformations of image content. First, we introduce color jitter following basic data augmentation principles [23], followed by random brightness, saturation, hue, and gamma transformation. Gaussian blurring and noise are randomly introduced over the entire image sequence to render the scenes more challenging.

Aside from static augmentation, the contents of natural scenes vary greatly, including changes in brightness and hue as illumination varies. We simulated such scenarios by monotonically increasing or decreasing illumination, saturation, hue, and random jitter over time. Besides, we simulated regional linear motion blur and camera defocus blur across frames, thus forcing the network to focus on structural object information rather than pixel differences.

Specifically, the content transformation of a sequence of images can be written as \({\mathcal {T}}_I^C(\cdot ): \{I_t\}\longmapsto \{{\widetilde{I}}^{{\mathcal {C}}}_t\}\). In a relatively simple case, \({\mathcal {T}}_I^C(\cdot )\) neither changes the locations of pixels nor introduces a new occlusion. Therefore, no extra transform is required for the corresponding pseudo-label, and the self-supervised loss of the content variation training enhancer can be expressed as

$$\begin{aligned} \ell _{cve}(t) \sim {\textbf{E}}_{{\textbf{p}}}\Big ({\textbf{C}}\big ({F^*_t}({\textbf{p}}) , F_t^{{\mathcal {C}}}({\textbf{p}})\big ) \Big ) \end{aligned}$$
(19)

Qualitative illustrations of the three transformations \({\mathcal {T}}_I^O(\cdot )\), \({\mathcal {T}}_I^S(\cdot )\), and \({\mathcal {T}}_I^C(\cdot )\) are shown in Fig.  4.

Fig. 5
figure 5

Complete training pipeline based on self-supervised learning. Three dynamic training enhancers expand the original sequence to scenes wherein the content, spatial position, and dynamic occlusion vary. Four forward inferences and one backward inference are performed during each iteration: the first forward and backward inferences used for basic unsupervised training on original scenes and computation of confidence masks; The final three forward inferences evaluate extended scenes and are back-propagated using a self-supervised loss function. The entire process relies on long dynamic sequences, allowing the network to understand variations and occlusions in a dynamic environment. The execution order is marked from \({\textcircled {1}}\) to \({\textcircled {6}}\)

Self-supervised distillation

During self-supervised learning, a teacher model generates pseudo-labels to supervise the predictions of a student model. However, the process can be simplified to online learning based on a single model that acts as both the teacher and student. We adopt the strategy of ARFlow [23] by running an extra forward pass on each transformed scene using the transformed pseudo-label as the self-supervision signal. We first expand all scenes using the previously defined transformations \(\{{\mathcal {T}}_I^O(\cdot ), {\mathcal {T}}_I^S(\cdot ), {\mathcal {T}}_I^C(\cdot )\}\) to obtain the three scenes \(\{\widetilde{I^{{\mathcal {O}}}_t}\}, \{\widetilde{I^{{\mathcal {S}}}_t}\}, \{\widetilde{I^{{\mathcal {C}}}_t}\}\}\) based on the original scene sequence \(\{I_t\}\). Training involves four forward inferences, the first of which generates the optical flow pseudo-label \(\{F_t^*\}\) of the original scene; the subsequent three yield the optical flow of the dynamic occlusion, spatial variation, and content variation respectively, which then undergo flow transformation \(\{{\mathcal {T}}_F^O(\cdot ), {\mathcal {T}}_F^S(\cdot ),{\mathcal {T}}_F^C(\cdot )\}\) to ensure appropriate image transformation. By combining the loss functions of Eqs. 7, 13, 18, and 19, the final loss function for self-supervised distillation can be formalized as

$$\begin{aligned} {\mathcal {L}}_{self}\sim \frac{1}{N-1} \sum _{t=1}^{N-1} \underbrace{{\mathcal {L}}_t}_{\mathrm{\tiny 1st Infer}} + \varLambda \cdot \Bigg [\underbrace{\ell _{doe}(t), \ell _{sve}(t) ,\ell _{cve}(t)}_{\mathrm{\tiny 2nd, 3rd, 4th Infer}}\Bigg ].\nonumber \\ \end{aligned}$$
(20)

where \(\varLambda = [\lambda _{3},\lambda _{4}, \lambda _{5}]^T\) are the combined weight parameters for self-supervised loss. By back-propagating \({\mathcal {L}}_{self}\), we achieve both basic unsupervised optical flow learning and self-supervised distillation. The overall pipeline of the self-supervised distillation training is illustrated in Fig. 5. Another alternative approach involves applying all three enhancement techniques simultaneously to the video sequence. However, we chose to treat these scenarios separately due to concerns about potential complexity and loss of control. The combination of all these techniques could lead to a situation that is overly intricate and difficult to manage. For example, CVE might distort the color appearance of the simulated moving occluder, introducing errors into the unsupervised loss function. Furthermore, while DOE involves cropping the image to a smaller size to simulate boundary occlusion, both CVE and SVE retain the original image dimensions. Handling these techniques separately facilitates augmenting multi-scale samples in training.

Experiments

We performed comprehensive experiments using several well-established datasets including MPI-Sintel [4], KITTI 2012 [8], KITTI 2015 [32], and FlyingChair [6]. MPI-Sintel and KITTI were used to validate the performance of our model compared to state-of-the-art methods. The FlyingChair dataset was used for model pre-training.

Table 1 Comparison on the sintel benchmark
Table 2 Comparison on KITTI benchmark
Fig. 6
figure 6

(A) Visualization of results on Sintel and KITTI test benchmark. We qualitatively compared the results to those of ARFlow-MV [23] and UFlow [20], where the red box indicates the region wherein our method dominates. Quantitative comparisons are shown in Tables 1 and 2. All results were provided by the official test server; (B) visualized analysis on Sintel val set. Baseline: ULDENet with basic unsupervised learning. DTE: dynamic training enhancer. MF: multi-frame inference. The figure demonstrates that DTE significantly improves performance in dynamic and occluded regions

Implementation details

Using Eqs. 7 and 20, we set the loss function weights \(\{\lambda _1, \lambda _2, \lambda _3, \lambda _4, and\ \lambda _5\}\) to \(\{50,0.005,0.3,0.3, 0.3\}\) for the Sintel dataset and \(\{75,0.001,0.2,0.2, 0.2\}\) for the KITTI dataset, respectively, which follows previous optical flow estimation studies [23]. We implemented our method using PyTorch on a workstation with four parallel RTX A6000 GPUs running CUDA 10.1. All models were trained by the Adam optimizer [21] with \(\beta _1 = 0.9\), \( \beta _2 = 0.99\). We pre-trained the models on FlyingChair with a batch size of eight and learning rate of \({\textrm{1e}}^{-4}\), followed by formal training on the Sintel or KITTI dataset. In the first stage, we used the basic unsupervised loss function with a sequence length of six, batch size of four, and learning rate of \(1e^{-4}\) over 100 epochs of training. In the second stage, we performed 50 epochs of intensive training using self-supervised distillation, where the sequence length was eight per iteration, the batch size was four, and the learning rate was \({\textrm{5e}}^{-5}\) for Sintel and \({\textrm{8e}}^{-5}\) for KITTI. Gradient clipping was applied during all training to accelerate convergence.

In terms of preprocessing, images from the Sintel dataset were cropped to \(384\times 832\) and inferred to be at the original sizes during validation; the images in the KITTI dataset were resized to \(256\times 832\) for both training and validation. We used random image horizontal flipping and sequence reversion to augment the datasets. During training, we mixed Clean, Final, and Albedo scenes from the Sintel training set. We employed the multi-view extension version of the KITTI dataset for sequence training and the flow ground truths were used only for validation. The standard average endpoint error (EPE) and percentage of erroneous pixels (F1) served as evaluation metrics of optical flow. During validation of the Sintel dataset, we dynamically input all frames of each scene into the model and validated the results for each frame. During validation of the KITTI dataset, we input frames 1st–11th and validated the results of frame 10th only.

Table 3 Main ablation study on different combinations of the proposed components

Comparison of the Sintel and KITTI benchmarks

We compared our method with both supervised and unsupervised approaches using the Sintel Clean and Final datasets. The test results were validated by the official Sintel website. In Table  1, our method achieves the highest accuracy on test datasets while maintaining relatively low parameter overheads. Specifically, on the Sintel test set, our results for Clean scenes surpassed those of the previous OIFlow [27], registering an 18% decrease in the EPE metric with only a 48% memory overhead. When juxtaposed with the latest semi-supervised method, ALFlow [52], our model exhibited improved generalization performance with a comparable number of model parameters and inference speed (ours: 2.50 M, 17 ms/frame; ALFlow: 2.24 M, 16 ms/frame).

Furthermore, our method was evaluated using both the KITTI 2012 and KITTI 2015 benchmarks. The quantitative results are presented in Table 2. When pitted against other unsupervised learning methods, ULDENet displayed superior EPE performance. On the KITTI 2015 dataset, our approach surpassed UPFLow [29], achieving a 9% error reduction in EPE (from 2.45 to 2.23) while using only 71% of the UPFlow parameters. For the KITTI 2015, we reduced the F1-all value from UFlow’s [20] 11.1% to 9.1%, an 18.0% error reduction. On the KITTI 2012, ULDENet, when applied to the training set, reduced UPFlow’s 1.27 EPE to 1.15, amounting to a 9% error reduction.

Table 4 Ablation study on model architectures. The AEPEs in specific regions of the scene and numbers of CNN parameters are shown

Qualitative comparisons for the Sintel and KITTI test benchmarks, corresponding to the table values, are displayed in Fig. 6A. Our model’s gains are marked in the yellow dashed box, which typically highlights regions prone to occlusions and dynamic variations. Figure 6B delves deeper into this phenomenon by individually visualizing the results of different model configurations. The baseline exhibits inadequate results in dynamic and occlusion zones, typical challenges for optical flow estimation. Our proposed Dynamic Training Enhancer (DTE) significantly alleviates such errors. This is because DTE simulates multiple dynamic variation scenarios and occlusion phenomena in self-supervised training, offering the model the correct responses to navigate these challenging scenes. Furthermore, multi-frame inference refines flow predictions by incorporating additional temporal context. The final performance of our proposed model, as shown in Tables 1 and 2, results from the synergy of the various components we introduced. In the subsequent ablation study section, we offer a more detailed analysis and discussion.

Ablation study

An extensive and ablation study was conducted. The Sintel training set was subdivided into separate training and validation sets using [23]. AEPE errors for all pixels (all), non-occluded pixels (NOC) and occluded pixels (OCC) are reported.

Main ablation

All components that we used, including the multi-frame recurrent inference structure (MFI), TSR, and the dynamic training enhancer (DTE), were comprehensively verified using the Sintel dataset. As indicated in Table 3, ‘Base’ represents the baseline, i.e., the original lightweight PWCNet operating in the double-frame inference mode. The MF achieves multi-frame training and inference via the spatial-temporal dual recurrent block, which greatly improves performance but increases the number of parameters by 0.4  M. After adding TSR, the model becomes more robust in occluded regions. TSR more reliably deals with occluded regions by exploring prior motion in adjacent frames, and the EPE of OCC regions in the clean dataset falls from 19.36 to 18.74. DTE significantly improves network performance without the need for additional memory or computational overhead. Quantitatively, the network performance improves from an EPE of 2.17 to 1.75, with a 21% decrease in the EPE of OCC regions. After integrating all strategies, we eventually improved performance from an EPE of 2.23 to 1.67 for the Clean dataset, and from 3.23 to 2.76 for the Final dataset, thus reducing the EPE by an average of 20%.

Network structure

Table 4 concisely compares models with different structures. To make a fair comparison, we used the same loss strategy of ARFlow for all models. Our network substantially outperformed PWCNet [41] and required fewer network parameters. Compared to the multi-frame approach ARFlow-MV [23], which extracts five adjacent frames before and after the current time, we only input frames before the current time, thus allowing ULDENet to be fitted to a real-time task. Also, our method handles sequences of arbitrary lengths, unlike most static multi-frame methods such as ARFlow-MV, SelFlow [26] and MFOccFlow [17]. Quantitatively, we significantly reduced the EPE at the cost of a slight increase in memory overhead.

Fig. 7
figure 7

Network EPE performance with increasing numbers of inference frames. After multi-frame dynamic training, the model converges using sequences of arbitrary lengths. The sequence in the table indicate that the network uses all frames of the scene during inference. The model performance gradually improves as the numbers of inference frames increases

Although our model is trained in a dynamic environment with multiple frames, it still shows good generalization in terms of two-frame inference after convergence. Specifically, we split each scene of the Sintel dataset into different frames to verify the accuracy w.r.t. the inference frames; the results are shown in Fig. 7. The EPE of the model gradually decreases as the number of inference frames increases, indicating that multi-frame inference increases the gain on the Final dataset with more complicated scenes, in turn suggesting that ULDENet is optimal when working in complex and dynamic real-time environments.

We experimentally evaluated model runtime performance via inference over 100 consecutive frames (512 \(\times \) 512). The real-time processing speed was only 17 ms per frame, which is more than two times faster than the well-known RAFT model (39 ms). During inference, our model recurrently iterates over new frames; as the number of frames increases, the processing time would be expected to increase linearly. However, recursive inference also produces more frames of optical flow as time passes, maintaining a constant average processing time per frame. This makes our model appropriate for processing continuous video streams, such as the Sintel dataset, without additional time. Our model is well-suited for dynamic real-time scenarios with over 50 FPS high inference speed.

Temporal smoothness regularization

Table 5 Ablation study on the weight of temporal smoothness loss \(W_{tsm}\)
Table 6 Ablation study using different combinations of the three training enhancers

Using a grid search method, we verified the effect of TSR on optical flow estimations with different weights, as summarized in Table  5. We employed the ULDENet structure with the basic unsupervised loss strategy as the baseline without DTE. Both the KITTI 2015 and Sintel datasets were included in the validation. Large-scale TSR degrades network performance; the weights should be small. Compared to NOC regions, TSR achieves more significant gains when estimating OCC regions because the photometric loss function robustly governs the NOC regions, whereas OCC regions require both temporal and spatial smoothing regularization. As shown in Table  5, the TSR weight was 0.01 for the KITTI dataset and 0.05 for the Sintel dataset.

Dynamic training enhancer

During self-supervised learning, the DTE improves the generalization of the three specific scenes and thus greatly enhances model performance. We separately validated the performances of the DOE, SVE, and CVE and combinations thereof.

Table 6 summarizes the DTE ablation studies using the Sintel Clean, Sintel Final, KITTI 2012, and KITTI 2015 datasets, where the ‘Baseline’ is ULDENet with the basic unsupervised loss combination and TSR. Quantitatively, DOE alone reduces the EPE from 2.17 to 1.86, providing the best improvement of 14.3%, and the SVE and CVE yield 10% and 11.5% performance gains, respectively. Combining the three reduces the EPE from 2.17 to 1.75 (20% error reduction).

Upon separately evaluating the OCC and NOC regions, it is evident that DOE most effectively refines predictions in the OCC zones, trimming the EPE from 19.36 to 16.25 for the Sintel Clean dataset. The CVE also slightly improved OCC regional accuracy; this can be attributed to the dynamic noise and regional blurring are similar to the information loss caused by occlusion. In the regular NOC regions, both the CVE and SVE reduced network prediction errors.

The dynamic variation and static image transformations are shown in Table 7, where CT and ST perform identical content and spatial image transformations across all frame sequences, including image flipping and Gaussian blurring similar to those of the data augmentation method [23]. Stc Occ refers to image cropping and random noise occlusion of image subregions with the original unsupervised losses [26]. All hyperparameters were consistent with the dynamic scenario. It is clear that dynamic variation and occlusion achieved higher performance than static image augmentation; our dynamic occlusion simulation with mixed occlusion loss (Dyc Occ) outperforms Stc Occ by 10% gains.

Table 7 Ablation study comparing dynamic scene variation to static data augmentation

Conclusion and limitations

In this research, we applied unsupervised learning-based optical flow estimations to long-term dynamic environments using multi-frame dynamic training strategies. Key contributions include the development of a lightweight model with a spatial-temporal dual recurrent structure, demonstrating effectiveness in ablation studies. Besides, we introduced three temporal dynamic training enhancers—DOE, CVE, and SVE, which jointly enhance performance by about 20% without adding computational or memory overheads. Specifically, the TSR was implemented to address occlusion challenges, providing reliable references for occluded areas from adjacent frames. Combining these techniques, our lightweight model rivals state-of-the-art results across multiple standard benchmarks, highlighting the efficiency of temporal dynamic modeling for unsupervised optical estimation.

This study has several limitations while also highlighting promising future directions: (1) the dynamic training enhancer introduces online scene transitions during training, which increases data processing time. Each training iteration necessitates five inferences, making the training phase roughly five times longer than the baseline. (2) The proposed TSR relies on short-time-window motion invariance. Therefore, its advantages may be decreased in low-sampling-rate videos or fast-motion scenes. However, this opens research avenues: high-frame-rate cameras could shrink the video sampling window, potentially boosting TSR’s effectiveness. This is especially promising given that a high-speed camera is more accessible than acquiring optical flow ground truths in natural settings. (3) While we simulate the continuous movement of artificial occluders using the Markov property, these super-pixel algorithm-shaped image fragments do not fully capture natural objects. Future work could utilize instance segmentation networks to extract entire objects from scenes, simulating their motion more naturally.