Breaking Time Invariance: Assorted-Time Normalization for RNNs

Methods such as Layer Normalization (LN) and Batch Normalization (BN) have proven to be effective in improving the training of Recurrent Neural Networks (RNNs). However, existing methods normalize using only the instantaneous information at one particular time step, and the result of the normalization is a preactivation state with a time-independent distribution. This implementation fails to account for certain temporal differences inherent in the inputs and the architecture of RNNs. Since these networks share weights across time steps, it may also be desirable to account for the connections between time steps in the normalization scheme. In this paper, we propose a normalization method called Assorted-Time Normalization (ATN), which preserves information from multiple consecutive time steps and normalizes using them. This setup allows us to introduce longer time dependencies into the traditional normalization methods without introducing any new trainable parameters. We present theoretical derivations for the gradient propagation and prove the weight scaling invariance property. Our experiments applying ATN to LN demonstrate consistent improvement on various tasks, such as Adding, Copying, and Denoise Problems and Language Modeling Problems.


Introduction
The Recurrent Neural Network (RNN) [1,2], and variants such as Long Short Term Memory (LSTM) [3] or Gated Recurrent Unit (GRU) [4,5,6], are some of the core architectures used for modeling time-series data in Deep Learning today.While LSTMs and GRUs are effective in avoiding problems with vanishing gradients, all of these recurrent models are still subject to issues with exploding gradients, as well as over-fitting.One of the most successful ideas that have been introduced over the years is the normalization of RNNs using methods such as Layer Normalization (LN) [7] and Batch Normalization (BN) [8,9].These methods recenter and rescale the preactivation information using the statistics of that time step.This allows for the norm of the model's states and gradients to be controlled, which speeds up training and prevents exploding gradients.
While these normalization methods have been successful, their applications to RNNs do not involve adaptations to some of the primary characteristics of this class of models, namely that the variation across time imparts usable information.For example, the LN or BN models are invariant to the scaling in the input at any time step and are therefore independent of the norm of the input vector at each time step.Depending on applications, this may have devastating consequences.Additionally, LN and BN produce a preactivation state with a distribution that is invariant across time.Such time invariance properties may impede the architectural structure of RNNs ability to fully exploit the temporal dependencies.Since RNNs share weights across time steps, it would be quite natural to introduce this dependency into the normalization method as well.An attempted version of this involving averaging statistics across time was mentioned in [9] but was unsuccessful and was presented without much detail.It appears that simply averaging over every time step is an overcorrection that makes the statistics susceptible to diluted averages and loses effectiveness further into the sequence.Instead, we argue that, by collecting the mean and variance across a smaller subsequence, one is able to gain the benefits of these time dependencies without overly weakening the impact of a single time step.
In this paper, we propose a normalization method called Assorted-Time Normalization (ATN), which preserves information from multiple consecutive time steps and normalizes using them.Our ATN method can be combined with other normalization methods such as LN and BN that normalize input information along some dimensions but not time.It maintains a short-term memory of the previous k time steps, which allows it to account for the temporal dependencies in a way in which previous methods were incapable.We use that memory to calculate the statistics with respect to which we normalize, giving us an output that has a controlled mean and variance while still being capable of changing between time steps.By using just a limited subsequence at each point in time, we are able to avoid the problems that come from using all or none of the sequences and find the length best suited to the dataset.Since this process just adds a time component to the normalization method, it is adapting without the introduction of any new learnable parameters.
We present theoretical derivations for the gradient propagation and prove the weight scaling invariance property.Our experiments demonstrate consistent improvement using our method on a variety of tasks, such as Adding, Copying, and Denoise Problems as well as Language Modeling Problems.Our code is available at https://github.com/vasily789/atn.

Related Work
One of the earliest attempts to use some sort of normalization technique throughout model layers was Batch Normalization (BN) [8].It was proposed for Fully Connected (FC) and Convolutional (CNN) Neural Networks for normalization of network activations across the batch dimension.BN is known often to provide a more stable and accelerated training regimen while improving generalizations.The Instance Normalization (IN) [10] method, contrary to BN, acts like contrast normalization and has primarily been used for image-containing datasets.The paper points out that the output stylized images should not rely on the contrast of the input image content, and hence normalizing the instances helps.The Group Normalization (GN) [11] method, which is primarily used for CNNs, normalizes a 3D feature in a convolutional layer by dividing its channels into groups and then normalizing the features in the group in all three dimensions.
Consider the typical structure of an RNN, also known as an RNN cell: where f is a nonlinear activation function.
The Recurrent Batch Normalization [9] method applies BN to the hiddento-hidden and memory cell parts of the LSTM model, which aims to reduce the internal covariate shift between consecutive time steps.[12] proposed a Weight Normalization (WN) method.Their idea lies in decoupling the magnitude from the direction of the weight vector to change the parameters of the network, which helps with speeding up learning.Unfortunately, WN appears not widely used in practice due to its limited stability compared to BN [13].
Layer Normalization (LN) was proposed in [7] to normalize activations along the hidden dimension for both FC networks and RNNs and has since become very popular in RNNs.LN normalizes the preactivation state as follows: where the LN operator is defined by Such a setup helps to get rid of the BN batch dependency and simplifies the application to RNNs.
More recently, Adaptive Normalization (AdaNorm) [14] made a thorough analysis of LN, and concluded that the rescaling and recentering factors, γ and β in (6), are not as essential as the backward gradients of the mean and variance inside of the LN method.In addition, they proposed a new method, AdaNorm, which replaces weight and bias with some new transformation function.

Assorted Time Normalization
One undesirable property of the adaptation of LN to RNNs is that the statistics for the normalization are calculated at each time step, resulting in Consider the preactivation state tensor in R 5×3×3 .At t = 1, we use a standard LN; at t = 2, we normalize using information from time steps 1, 2; at t = 3 ATN method uses information from time step t = 1, 2, 3; at t = 4, we normalize with respect to time steps t = 2, 3, 4; and so on after that.
a post-normalization state which has mean and variance that are invariant across time.This prevents the model from effectively representing the shifting distributions across time that might be critical in modeling sequential data.For example, the normalization LN W x x (t) in ( 3) is invariant to scaling in x (t) , which restricts the model from learning the changing norm of x (t) .This may be mitigated by including a bias in the linear term, which is often used in implementations; see section 5.1 for more discussion.Most of the above discussions also apply to BN.We propose a new normalization method to break this time invariance.Consider a sequence a = a (t) ⊂ R n produced in an RNN, such as the preactivation state that we wish to normalize.At time step t of the RNN, we maintain a memory of the previous k entries, a (t) k = a (t−k+1) , . . ., a (t−1) , a (t) ⊂ a, in the normalization layer, using this extended set to compute the mean and variance to be used for normalization.This can be combined with other normalization methods.Combining with Layer Normalization, for example, these statistics are calculated at time-step t as follows: See Figure 1 for a visual depiction of our method.Once these statistics are calculated, we then normalize only the current term a (t) and optionally recenter and rescale using γ and β, two trainable parameters shared across time while adding a small epsilon to the variance to prevent division by zero, similar to the LN method in (6).
This differs from the process in ( 5) and ( 6) in that we include multiple time steps in our statistic calculations, giving us a double sum instead of the single one for LN.This definition of the statistics is more stable in time, at least for large k, changing modestly at each time step with only one term in the set being replaced.This results in a normalized output that is not expected to have a uniform mean and variance across time steps.We argue that this is desirable for sequential problems.Having this potential for variation allows for the model to account for changing norms of the inputs across the sequence, providing additional information about the distribution that is lost with previous methods.
We may consider using all previous terms in the sequence to compute the statistics, but they will have more variation in early time steps than in later ones.By keeping only k time steps and not the entire sequence, the statistics will vary gradually across time, and we are able to fix the memory and computational costs, which could be significant for long sequences.
Using the information from multiple time steps also effectively provides a larger set on which to calculate statistics.This allows a more accurate approximation to gain a clearer glimpse at the underlying distribution of the dataset.In other words, ATN uses statistics over a larger set that is more stable across time so that the normalized state can retain more variations in time.In contrast, the traditional normalization methods use high-frequency statistics at each time step to produce a normalized state that becomes timeinvariant.In particular, the ATN network depends on the scaling of the input vector at a time step while LN and BN do not.However, ATN preserves the desirable weight scaling invariant property, which we show as follows: Let H and H be weight matrices for two sets of model parameters, θ and θ respectively, which differ by a scaling factor of δ, i.e.H = δH.Then the outputs of ATN are the same: where σt,k = δσ t,k and μt,k = δµ t,k .This invariance property makes the ATN network independent of the norm of H, mitigating the exploding/vanishing gradient problems.
It is also easy to see that ATN is also invariant to the rescaling of the whole input sequence but not invariant under the rescaling of an individual element in the sequence.
During training, we backpropagate the gradients with respect to the model parameters.With ATN, a key step is to propagate the gradient through the normalization layer, i.e. ∂y . The following proposition gives the formulas for computing these derivatives.The proof is provided in Appendix A.
Proposition 1.Consider ATN for a sequence a = {a (t) } ⊂ R n produced in a RNN and let y (t) = AT N (a we have: where Note that the computations of ∂y In our experiments, we will use ATN on LSTM networks.Following [7] and [9], our ATN method for LSTM is as follows : 15) where is the Hadamard product and σ(•) is the sigmoid function.

Experiments
We have performed a series of experiments which include the Copying [15], Adding [15], and Denoise problems [5,16] as well as Language Modeling on character level Penn Treebank dataset [17] and word level WikiText-2 dataset [18].
All experiments were run using Python 3.7.0,PyTorch 1.1.0,and CUDA 9.0 on a single NVIDIA Tesla V100 GPU.

Synthetic Tasks 4.1.1. Copying
The copying problem is a common synthetic task that is used to test RNNs, which was originally proposed in [15].For this problem, a string of 10 digits is fed into the RNN sampled uniformly from the integers between 1 and 8.A sequence of T zeros follows this, and a 9, marking the start of a string of 9 zeros, for a total length of T + 20.The objective of the task is to output the initial string of 10 digits beginning at the marker's location, copying the initial string from the front to the back.Cross-entropy loss is used to evaluate this model, with a baseline expected cross-entropy of 10 log (8)   T +20 which represents selecting digits 1-8 at random after the 9.
Implementation Details: The models were trained with a batch size of 128, a single LSTM layer with a hidden size of 68, an RMSProp [19] optimizer with a learning rate of 10 −4 , and T values of 100 and 200.The ATN model is implemented with k = 45 for both T values.
Results: For each of the sequence lengths tested, the plain LSTM is incapable of achieving losses below the baseline.While the LN-LSTM is able to do so to some extent on the T = 100 version, see Figure 2a, it also    gets stuck at the baseline loss on the T = 200 task, Figure 2b.For both of these tasks, our ATN-LSTM model demonstrates eventual losses below those reached by the LN-LSTM model, Figures 2a and 2b.We also note that the initial rate of convergence is at least as steep if not steeper than that of the LN-LSTM model, demonstrating that the ATN-LSTM has a positive contribution to training in both the short and long term.

Adding
The adding problem is another synthetic task for RNNs proposed in [15].Our implementation of this problem is a variation of the original problem.The RNN takes a 2-dimensional input of length T. The first dimension consists of a sequence of zeros except for two ones placed randomly in the first and second half of the sequence.The second dimension is a sequence of numbers selected uniformly from [0, 1).The goal of the task is to take the numbers from the second dimension in positions corresponding to the ones and to output their sum.
Implementation Details: The models were trained with a batch size of 50, a single LSTM layer with a hidden size of 60, and an RMSprop [19] optimizer with a learning rate of 10 −3 .We use T values of 100 and 200.This task is evaluated with a mean-squared error.The ATN model is implemented with k values of 25 for T = 100 and 5 for T = 200.
Results: Our model shows consistent improvement over the LSTM and LN-LSTM models.For each example, the ATN shows a rapid initial convergence before settling into a slower rate which is roughly parallel to that of the LN-LSTM.In Figure 3a, this initial conversion almost manages to take the model to the same loss as is achieved by the LN-LSTM after the entirety of the training.In Figure 3b, the LN-LSTM is able to separate itself further from the LSTM than in Figure 3a but is still at a higher loss than the ATN for all but the very beginning of training.

Denoise Task
The Denoise Task [5,16] is another synthetic problem that requires filtering out the noise out of a noisy sequence.This problem requires the forgetting ability of the network as well as learning long-term dependencies coming from the data [5].The input sequence of length T contains 10 randomly located data points, and the other T − 10 points are considered noise data.These 10 points are selected from a dictionary {a i } n+1 i=0 , where the first n elements        are data points, and the other two are the "noise" and the "marker" respectively.The output data consists of the list of the data points from the input, and it should be outputted as soon as it receives the "marker".The goal is to filter out the noise and output the random 10 data points chosen from the input.Implementation Details: The models were trained using a batch size of 128, a single LSTM layer with a hidden size of 100, and Adam [20] optimizer with a learning rate of 10 −2 .We use T values of 100 and 200.The ATN model is implemented with k values of 20 and 60 for T = 100 and T = 200 respectively.
Results: For both sequence lengths, our models outperform the LSTM and the LN-LSTM throughout training.While the LN-LSTM model can surpass the baseline set by the LSTM, it does so later than the ATN model, and its convergence curve flattens out at a higher loss than the ATN model.

Language Models
Language modeling is one of many natural language processing tasks.It is the development of probabilistic models that are capable of predicting the next word or character in a sequence using information that has preceded it.For both of the Language Modeling problems, we based our experiments on the AWD-LSTM model [21].

Character Level Penn Treebank
The models were tested on their suitability for language modeling tasks using the character level Penn Treebank dataset [17] also known as character-PTB or simply cPTB dataset.This dataset is a collection of English-language Wall Street Journal articles.The dataset consists of a vocabulary of 10,000 words with other words replaced as <unk>, resulting in approximately 6 million characters that are divided into 5.1 million, 400 thousand, and 450 thousand character sets for training, validation, and testing, respectively with a character alphabet size of 50.The goal of the character-level Language Modeling task is to predict the next character given the preceding sequence of characters.
Implementation Details: For this task, we partitioned the training sequence into 220 character length subsequences.The models were trained using a batch size of 32, a single LSTM layer with a hidden size of 1,000, an Adam [20] optimizer with a learning rate of 10   Results: Our model shows improvement over the LSTM and the LN-LSTM models, the comparison results are presented in Table 4.

WikiText-2
The WikiText-2 dataset was introduced in [18].It is approximately two times the size of the Penn Treebank dataset and contains preprocessed Wikipedia articles while maintaining the original structure, punctuation, and symbols.The WikiText-2 dataset consists of approximately 2.2 million words: 2 million for the training set and 200 thousand for the validation and test sets, with a vocabulary size of 33,278.This task is a word-level Language Modeling problem with the goal to predict the next word given the preceding sequence of words.
Implementation Details: We used a batch size of 32; three LSTM layers with embedding and hidden sizes of 400 and 1,150, respectively; BPTT values of 70; gradient clipping on the norm of 0.25; and learning rate of 30  with Stochastic Gradient Descent (SGD) optimizer without any momentum or learning rate decay, and switch to ASGD [22] optimizer using nonmono criteria from [21] with value 5 (our experiments showed that switching happens approximately between epochs 20 and 30 for all models: LSTM, LN, and ATN).The ATN model is implemented with a k value of 25.
Results: In this experiment, the ATN method shows improvement over LSTM and LN method in both training and validation perplexity (PPL), see Table 5.

Input statistic invariance across time
In most implementations of LN-LSTM, including the one used in the experiments above, the inputs to the normalization method are the results of a linear layer, including both weight and bias.This differs slightly from the model proposed in [7] in that their version placed the LSTM bias outside of the normalization.Using that original architecture, we can clearly demonstrate the underlying problem with Layer Normalization that we aim to solve, the loss of input information, by setting the statistics to constant values across time.
To show this, we use the MNIST dataset [23] after applying Gaussian noise with variance 0.1, for the pixel-by-pixel task [24].This task takes the pixel values of a handwritten digit and inputs them as an unpermuted sequence of length 784 in order to predict the digit class.Due to the high probability of pixels having near zero values, we needed to use ε values of 1 in both normalization schemes.With this task, we can see in Figure 5 that the use of Layer Normalization renders the model completely incapable of training.Because LN takes the information from each pixel and normalizes it to the exact same distribution, it erases everything the model could use to learn, making it no better than guessing.The ATN method with k = 10 solves this problem by its use of multiple time steps in calculating the mean and variance, meaning that the normalized outputs will not all have identical statistics.This change allows ATN to perform quite well, even when Layer Normalization cannot.

Post Normalization Statistics
In Figure 6, we present the statistics of the post normalization components from a single iteration of training for the Adding Problem [15] described in Section 4.1.2with T = 75.We present the statistics from four different models, an LN-LSTM, and three ATN(k) models with k values of 5, 25, and 55.All of the models did not include the use of trainable bias and gain parameters inside the normalization methods.
In Figure 6a, we show the mean and variance after normalization of the product of the hidden-to-hidden weight and the hidden state, W h h (t−1) .While Layer Normalization produces constant mean and variance, the ATN method allows for the statistics to vary at each time step, resulting in curves that do not differ too much from those for LN in terms of scale but do demonstrate the natural fluctuations in the hidden states.From this, we can see that we are achieving the combination of a controlled output that is still capable of reflecting the temporal changes of the network.
In Figure 6b, we show the statistics from the product of the input-tohidden weight and the input, W x x (t) .The ATN model provides highly variable means and variances, showcasing the amount of information about the dataset which is lost when LN resets the statistics to these constant values.
In Figure 6c, we show the post normalization statistics of the memory cell, c (t) .These statistics clearly demonstrate the effect of a shorter k value as opposed to a longer one in the mean.In the early iterations for the k = 5 model, the mean has a larger spike which flattens to a bit above zero by the end of the iteration.For the larger k values, this initially increased mean gets maintained throughout a larger portion of the iteration, causing the lower values further along to have less influence on the statistics.

Optimal k Value for ATN method
To highlight the importance of normalizing with respect to k time steps instead of just one or all of them, we present a study on various k values.In Figure 7, we present results on the Copying Problem [15] described in Section 4.1.1 with T = 100.For this experiment, we have trained LSTM, LN, and three ATN(k) models with values of k being 25, 45, and 65 under the same conditions.
All ATN models perform better than both LSTM and LN.The ATN(k = 45) model performs better than ATN(k = 25) which should not be a surprise since the larger k value would mean we are normalizing with respect to a larger set and getting better statistics for the mean and variance, however, ATN(k = 65) performs poorer than ATN(k = 45) and even poorer than ATN(k = 25) which suggests that too large k may actually degrade the result.This may be due to numerical difficulties in propagating derivative through k steps in ATN for a large k.

Conclusion
In this paper, we have introduced a method for adapting statistics-based normalization methods to recurrent neural networks to break the time invariance of the traditional normalization methods.We have presented theoretical results on the impact this method has on the model's gradients, as well as showing the preservation of invariance to the rescaling of the weight matrix.Our experiments demonstrate that our ATN-LSTM improves over LN for LSTM in both training and testing results.In light of the popularity of LN in practical applications, our method offers an important alternative for further improving RNN performance.re-scaling consist of changing every input example by multiplying or adding a constant.Single training case re-scaling is when the dataset adjustments are applied to just one example.Of particular interest is the invariance with respect to the scaling of an input at a single time point, which was referenced in Section 3.This is one of the invariance property which LN has that its ATN adaptation do not, and we argue that this is one of the reasons that our method improves on LN.

5 Figure 1 :
Figure1: Illustration of the ATN method combined with LN using k = 3 time steps: Consider the preactivation state tensor in R 5×3×3 .At t = 1, we use a standard LN; at t = 2, we normalize using information from time steps 1, 2; at t = 3 ATN method uses information from time step t = 1, 2, 3; at t = 4, we normalize with respect to time steps t = 2, 3, 4; and so on after that.

Figure 3 :
Figure 3: Results on the Adding problem for T = 100 and T = 200.

Figure 4 :
Figure 4: Results on the Denoise task for T = 100 and T = 200.

at 3 ,
and learning rate decay by a factor of 10 at epoch 80 and 90.The ATN model is implemented with a k value of 10.

Figure 6 :
Figure 6: Post Normalization Statistics for Adding Problem with T = 75

Figure 7 :
Figure 7: k value study in ATN method

Table 1 :
Copying Results: Attained minimum values.↓ -denotes the smaller, the better result.

Table 2 :
Adding Results: Attained minimum values.↓ -denotes the smaller, the better result.

Table 4 :
Character Level Penn Treebank Results: Attained minimum values.↓ -denotes the smaller, the better result