Deep Distributional Sequence Embeddings Based on a Wasserstein Loss

Deep metric learning employs deep neural networks to embed instances into a metric space such that distances between instances of the same class are small and distances between instances from different classes are large. In most existing deep metric learning techniques, the embedding of an instance is given by a feature vector produced by a deep neural network and Euclidean distance or cosine similarity defines distances between these vectors. This paper studies deep distributional embeddings of sequences, where the embedding of a sequence is given by the distribution of learned deep features across the sequence. The motivation for this is to better capture statistical information about the distribution of patterns within the sequence in the embedding. When embeddings are distributions rather than vectors, measuring distances between embeddings involves comparing their respective distributions. The paper therefore proposes a distance metric based on Wasserstein distances between the distributions and a corresponding loss function for metric learning, which leads to a novel end-to-end trainable embedding model. We empirically observe that distributional embeddings outperform standard vector embeddings and that training with the proposed Wasserstein metric outperforms training with other distance functions.


Introduction
Metric learning is concerned with learning a representation or embedding in which distances between instances of the same class are small and distances between instances of different classes are large.Deep metric learning approaches, in which the learned embedding is given by a deep neural network, have achieved state-of-the-art results in many tasks, including face verification and recognition (Schroff et al, 2015), fine-grained image classification (Reed et al, 2016), zero-shot classification (Bucher et al, 2016), speech-to-text problems (Gibiansky et al, 2017), and speaker identification (Li et al, 2017).An advantage of metric learning is that the resulting representation directly generalizes to unseen classes, so the model does not need to be retrained every time a new class is introduced.This is, for example, a typical requirement in biometric applications, where it should be possible to register new subjects without retraining a model.Biometric systems also have to handle imposters, that is, subjects who are not registered in the database, which is not straightforward in standard classification settings.
In this paper, we study deep metric learning for sequence data, with a specific focus on biometric problems.Building on earlier work on quantile layers (Abdelwahab and Landwehr, 2019), we specifically study how the distribution of learned deep features across a sequence can be represented in the learned embedding.Quantile layers are statistical aggregation layers that characterize the distribution of patterns within a sequence by approximating the quantile function of the activations of the learned filters across the sequence.Characterizing this distribution has been shown to be advantageous for biometric identification based on eye movement patterns (Abdelwahab and Landwehr, 2019).The main contribution of this paper is to develop a deep metric learning approach for distributional embeddings based on quantile layers.Quantile layers return an estimate of the distribution of values for each learned filter across the sequence.Instead of a fixed-length vector representation of an instance, in our approach, the embedding of an instance is given by these sets of distributions.When embeddings are distributions rather than simple vectors, measuring distances between the embeddings involves comparing their respective distributions.We propose a distance metric in the embedding space that is based on Wasserstein distances between the respective distributions.Compared to other distance functions such as Kulback-Leibler or Jensen-Shannon divergence, the advantage of using Wasserstein distance is that it takes into account the metric on the space in which the random variable of interest is defined.In our case, this means that distributions in which similar magnitudes of filter activations receive similar amounts of probability mass will be considered close.We show how such embeddings can be trained end-to-end on labeled training data using metric learning techniques.
Empirically, we study the proposed approach in biometric identification problems involving eye movement, accelerometer, and EEG data.Empirical results show that the proposed distributional sequence embeddings outperform standard vector embeddings and that training with the Wasserstein metric outperforms training with other distance functions.
The rest of the paper is organized as follows.Section 2 discusses related work.In Section 3 we review quantile layers and develop a distributional embedding architecture based on these layers.Section 4 introduces a Wasserstein-based distance metric for the proposed embedding model and from this derives a novel loss function for metric learning.In Section 5 we empirically study the proposed method and baselines.

Related work
Our work is motivated by the goal of capturing information about the distribution of patterns within a sequence in its embedding, where the patterns are defined in terms of learned features of a deep neural network.It is related to other work in deep learning that aims to capture distributions of learned features using statistical aggregation layers.Wang et al (2016) proposed endto-end learnable histogram layers that approximate the distribution of learned features by a histogram.Their work uses linear approximations to smoothen the sharp edges in a traditional histogram function and enable gradient flow.Sedighi and Fridrich (2017) proposed a similar histogram-based aggregation layer, but use Gaussian kernels as a soft, differentiable approximation to histogram bins.Abdelwahab and Landwehr (2019) introduced quantile layers to capture the distribution of learned features based on an approximation of the quantile function, and empirically showed that this outperforms aggregation using histograms.The contribution of our paper is to exploit quantile layers in metric learning, by defining distributional embeddings based on approximations of quantile functions and deriving loss functions for metric learning based on comparing the resulting distributions.
There is a large body of work on deep metric learning that studies different network architectures and loss functions.For example, Hadsell et al (2006) introduced a loss for a siamese network architecture that is based on all possible pairs of instances in the training data, and its objective is to minimize distances between positive pairs (same class) while maximizing the distances between negative pairs (different classes).More recently, Schroff et al (2015) introduced the triplet loss, with links positive and negative pairs by an anchor instance.This idea has later been extended by Oh Song et al (2016) and Sohn (2016) by providing several negative pairs linked to one positive pair to the loss function.The loss function introduced by Sohn (2016) has shown superior performance in several studies (Sohn, 2016;Wu et al, 2017;Yuan et al, 2017).Our method builds on these established deep metric learning techniques, but extends them by replacing vector embeddings with distributional embeddings, which requires corresponding changes in distance calculations and the loss function.
Distributional embeddings have also been studied in natural language processing in the context of word embeddings.Traditional word embedding models such as word2vec represent words as vectors in a metric space such that semantically similar words are mapped to similar vectors (Mikolov et al, 2013).Vilnis and McCallum (2015) extend this idea by mapping each word to a Gaussian distribution (with diagonal covariance), which naturally characterizes uncertainty about the embedding.Athiwaratkun and Wilson (2017) further extend this model by replacing the Gaussian distribution with a mixture of Gaus-sians, where the multimodal mixture can capture multiple meanings of the same word.The motivation for these distributional embeddings is somewhat different from our motivation in this paper: while the distribution in our model results from the inner structure of the instance being mapped (distribution of patterns within a sequence), the distribution in the model by Vilnis and Mc-Callum (2015) captures remaining uncertainty and is inferred during training.Another difference in the work by Vilnis and McCallum (2015) is that their model is trained in an unsupervised fashion, while we study supervised metric learning.An approach similar to that of Vilnis and McCallum (2015) has also been taken by Bojchevski and Günnemann (2018) in order to map nodes of an attributed graph onto Gaussian distributions that function as an embedding representation.This is again an unsupervised approach, and specific to the task of node embedding.
More generally, deep metric learning models have been recently used in different application domains featuring sequential data, including natural language processing (Mueller and Thyagarajan, 2016;Neculoiu et al, 2016), computer vision (McLaughlin et al, 2016;Wu et al, 2018) and speaker identification (Li et al, 2017;Chung et al, 2018), but these approaches are based on vector embeddings rather than distributional embeddings.

Quantile Layers and Distributional Sequence Embeddings
This section reviews quantile layers as introduced by Abdelwahab and Landwehr (2019) and discusses how they can be used to define distributional embeddings of variable-length sequences.
In this paper, we focus on variable-length sequences and deep convolutional neural network architectures that produce embeddings of such sequences.Typically, network architectures for such sequences would employ stacked convolution layers to extract informative features from the sequence, and in the last layer use some form of global pooling to transform the remaining variablelength representation into a fixed-length vector representation.Global pooling achieves this transformation by performing a simple aggregate operation such as taking the maximum or average over the filter activations across the sequence.This has the potential disadvantage that most information about the distribution of the filter activations is lost, which might be informative for the task at hand.In contrast, quantile layers aim to preserve as much information as possible about the distribution of filter activations along the sequence by approximating the quantile function of this distribution.Earlier work has shown that this information can be informative for sequence classification, substantially increasing predictive accuracy (Abdelwahab and Landwehr, 2019).
In this paper, we use quantile layers for defining distributional embeddings of sequences.We assume that instances are given by variable-length sequences of the form s = (x 1 , ..., x T ) where x t ∈ R D is a vector of attributes that describes the sequence element at position t.We denote the space of all such sequences with D attributes by S D = ∞ T =1 R T ×D .When a sequence is pro-cessed by a convolutional deep neural network architecture Γ, which we take to be the network without any final global aggregation layers, the result is a variable-length representation of the instance over K filters.We denote this mapping by Γ : S D → S K .Details of the deep convolutional architectures we employ are given in Section 5.For s ∈ S D and k ∈ {1, ..., K} we will use Γ k (s) to denote the variable-length sequence of activations of filter k produced by the network for sequence s.
As in Abdelwahab and Landwehr (2019) we use quantile functions in order to characterize the distribution of filter activations across the sequence Γ k (s).Let x ∈ R be a real-valued random variable, let p(x) denote its density and F (x) its cumulative distribution function.The quantile function for x is defined by ≥ r} where inf denotes the infimum.If F is continuous and strictly monotonically increasing, Q is simply the inverse of F .
Let X = {x 1 , ..., x N } be a sample of the random variable x, that is, x n ∼ p(x) for n ∈ {1, ..., N }.The empirical quantile function QX : (0, 1] → R is a non-parametric estimator of the quantile function Q.It is defined by where is the empirical cumulative distribution function and I(x i ≤ x) ∈ {0, 1} is an indicator.QX (r) is a piecewise constant function that is essentially obtained by sorting the samples in X .More formally, let π be a permutation that sorts the x i , that is, , where x denotes the smallest integer larger or equal to x.The empirical quantile function QX faithfully approximates the quantile function Q in the sense that | QX (r) − Q(r)| converges almost surely to zero if N → ∞ and Q is continuous at r (Resnick, 2013).
To enable gradient flow in end-to-end learning, we will work with a piecewise linear interpolation of the piecewise constant function QX (r).For i ∈ {1, ..., N } and r ∈ [ n−1 N , n N ] we can define a linear approximation by where we define x π(N +1) = x π(N ) to handle the right interval border.Combining the linear approximations over the different n, we obtain for r ∈ [0, 1] the piecewise linear approximation where δ(r, n) is an indicator function that is defined as one if r ∈ [ n−1 N , n N ] and zero otherwise.The piecewise linear approximation QX (r) of the quantile function depends on the sample size N , because there are N linear segments.To arrive at an approximation of the quantile function that is independent of the number of samples, we define a further piecewise linear approximation of QX (r) using M sampling points σ(α 1 ), ..., σ(α M ), where σ(α) = (1 + exp(−α)) −1 is the sigmoid function and α i ∈ R are parameters with α i ≤ α i+1 .Formally, we define where )] and zero otherwise, and we have introduced α 0 = −∞ and α M +1 = ∞ to handle border cases.The function QX (r) provides a piecewise linear approximation of the quantile function using M +1 line segments, independently of the sample size N .The parameters α i are learnable model parameters in the deep neural network architectures that we study in Section 5.
We are now ready to define the distributional embedding for an instance, which is obtained by passing the instance through the neural network Γ and for each filter in the output of Γ approximating the quantile function of the filter activations by the piecewise linear function Q.
Definition 1 (Distributional embedding of sequence) Let s ∈ S D and let Γ denote a convolutional neural network structure.The distributional embedding of sequence s is given by the vector of piecewise linear functions where QΓ k (s) is defined by Equation 2 using X = Γ k (s).Here, we slightly generalize the notation by identifying the sequence of observations Γ k (s) with the corresponding set of observations.
We note that due to the piecewise linear approximations, gradients can flow through the entire embedding architecture, both to parameters α m and the weights in the deep neural network structure Γ.This includes the sorting operation, where gradients can be passed through by reordering the gradient backpropagated from the layer above according to the sorting indices π.

A Wasserstein Loss for Distributional Embeddings
For training the embedding model, we will use deep metric learning approaches which train model parameters such that instances of the same class are close and instances of different classes are far apart in the embedding space.In order to apply such approaches, a distance metric needs to be defined on the embedding space.

Distances Between Distributional Embeddings
As discussed in Section 3, in our setting embeddings of instances are given by distributions.Measuring the distance between two embeddings thus means comparing their respective distributions.Different approaches to measure distances between probability distributions have been discussed in the literature.One of the most widely used distance functions between distributions is the Kullback-Leibler divergence.However, this measure is asymmetric and can result in infinite distances, and is therefore not a metric.A metric based on the Kullback-Leibler divergence is the square root of the Jensen-Shannon divergence, which is symmetric, bounded between zero and log(2), and satisfies the triangle inequality.However, this metric does not yield useful gradients in case the distributions being compared have disjoint support, which in our case would occur if two sequences with non-overlapping ranges of filter values are compared.To illustrate, let q 1 and q 2 denote densities with disjoint support A 1 and A 2 , and let m(x) = q1(x)+q2(x)

2
. Then the Jensen-Shannon divergence J of q 1 and q 2 is independently of the distance between A 1 and A 2 , resulting in a gradient of zero.
A different class of distance functions which are increasingly being studied in machine learning (Frogner et al, 2015;Gao and Kleywegt, 2016;Arjovsky et al, 2017) are Wasserstein distances.Wasserstein distances are based on the idea of optimal transport plans.They do not suffer from the zero-gradient problem exhibited by the Jensen-Shannon divergence, because they take into account the metric of the underlying space.They also guarantee continuity under mild assumptions, which is not the case for the Jensen-Shannon divergence as illustrated by Arjovsky et al (2017).In the general case, the p-Wasserstein distance (for p ∈ N) between two probability measures ρ 1 and ρ 2 over a space M with metric d can be defined as where J (ρ 1 , ρ 2 ) denotes the set of all joint measures on M×M with marginals ρ 1 and ρ 2 .For the purpose of this paper, we are interested in the case of real-valued random variables.If q 1 (x 1 ) and q 2 (x 2 ) are two densities defining distributions over real-valued random variables, x i ∈ R, the p-Wasserstein distance between q 1 and q 2 is given by 1  1 2 3 4 5 6 7  1 2 3 4 5 6 7  2 3 4 5 6 7  q q 1 2 3 Fig. 1 According to the Wasserstein metric, distributions q 1 and q 2 are closer than q 1 and q 3 , while distances would be identical under the Jensen-Shannon measure.
where J (q 1 , q 2 ) defines the set of all joint distributions over x 1 , x 2 which have marginals q 1 and q 2 .A joint distribution q ∈ J (q 1 , q 2 ) can be seen as a transport plan, that is, a way of moving probability mass from density q 1 such that the resulting density is q 2 , in the sense that q(x 1 , x 2 ) indicates how much mass is moved from q 1 (x 1 ) to q 2 (x 2 ).The quantity |x 1 − x 2 | p q(x 1 , x 2 )dx 1 dx 2 is the cost of the transport plan, which depends on the amount of probability mass moved, q(x 1 , x 2 ), and the distance by which the mass has been moved, |x 1 − x 2 | p .The infimum over the set J (q 1 , q 2 ) means that the distance between the distributions is given by the optimal transport plan, which intuitively characterizes the minimum changes that need to be made to q 1 in order to transform it into q 2 .For p = 1 the distance is therefore also called the Earth Mover Distance.The advantage of this measure is that it takes into account the metric in the underlying space, as can be seen from Figure 1.Here, q 1 is closer to q 2 than it is to q 3 in the sense that the probability mass needs to be moved less far.Thus, W p (q 1 , q 2 ) < W p (q 1 , q 3 ), while the Jensen-Shannon distances between the two pairs of distributions would be identical.
Because Wasserstein distances are defined in terms of optimal transport plans, computing them in general requires solving non-trivial optimization problems.However, for the case of real-valued random variables x i ∈ R, there is a simple closed-form solution to the infimum in Equation 7. Let x 1 ∼ q 1 , x 2 ∼ q 2 with x i ∈ R. According to Cambanis et al (1976), the function K(x 1 , x 2 ) = |x 1 − x 2 | p for p ≥ 1 is quasi-antitone and therefore the infimum of the expectation of this function over the set of all joint distributions, inf q∈J (q1,q2) E[K(x 1 , x 2 )], is given by is the quantile function to the density q i .We can thus rewrite Equation 7 as We now define the distance between two embeddings Ψ Γ (s) and Ψ Γ (s ) as the Wasserstein distance between the approximate representation of the quantile functions in the embedding as defined by Definition 1, summed over the different filters k.Definition 2 Let s, s ∈ S D , let Γ denote a convolutional neural network architecture, and let Ψ Γ (s) and Ψ Γ (s ) denote the distributional embeddings of s, s as defined by Definition 1. Then we define the distance between the embeddings as The next proposition gives a closed-form result for computing d p (Ψ Γ (s), Ψ Γ (s )).
Proposition 1 Let s, s ∈ S D , let Γ denote a convolutional neural network architecture, let Ψ Γ (s) and Ψ Γ (s ) denote the distributional embeddings of s, s , and let d p (Ψ Γ (s), Ψ Γ (s )) denote their distance as defined by Definition 2. Then where a X ,i and b X ,i for X ∈ {Γ k (s), Γ k (s )} are defined by Equations 3 and 4, σ is the sigmoid function, and as above we have introduced α 0 = −∞ and α M +1 = ∞ to handle border cases.
Proof (Proposition 1) Starting from Definition 2 and plugging in QΓ k (s) as defined by Equation 2, we see that where in Equation 12we use the notation In Equation 11 we integrate over subintervals [σ(α i ), σ(α i+1 )] of the interval [0, 1], and can therefore remove the indicator function δ(r, i).In Equation 12we solve the integral, where we exploit that according to product and chain rules ∂ ∂r The claim directly follows from Equation 12.
An important note with respect to the distance function d p (Ψ Γ (s), Ψ Γ (s )) is that its closed-form computation given by Proposition 1 allows gradients to be propagated through distance computations (as well as through embedding computations as discussed in Section 3) to the parameters of the model Γ defining the embedding.Moreover, all computations can be expressed using standard building blocks available in common deep learning frameworks, such that all gradients are available through automatic differentiation.

Loss Function
Deep metric learning trains models with loss functions that drive the model towards minimizing distances between pairs of instances from the same class (positive pairs) while maximizing distances between pairs of instances from different classes (negative pairs).Existing approaches differ in the way negative and positive pairs are selected and the exact formulation of the loss.For example, triplet-based losses as introduced by Schroff et al (2015) compare the distance between an anchor instance and another instance from the same class (positive pair) to the distance between the anchor instance and an instance from a different class (negative pair).However, comparing a positive pair with only a single negative pair does not take into account the distance to other classes and can thereby lead to suboptimal gradients; more recent approaches therefore often consider several negative pairs for each positive pair (Oh Song et al, 2016;Sohn, 2016).Inspired by these approaches, we consider several negative pairs for each positive pair, leading to a loss function of the form (s 1 , s 2 , s 3 , s 4 ) where P ⊂ S D × S D is a set of positive pairs and N ⊂ S D × S D is a set of negative pairs of instances, and (s 1 , s 2 , s 3 , s 4 ) is a loss function that penalizes cases in which a negative pair (s 3 , s 4 ) has smaller distance than a positive pair (s 1 , s 2 ).A straightforward linear formulation of the loss would be (s However, only pairs of pairs that violate the distance criterion should contribute to the loss, leading to (s 1 , s 2 , s 3 , s 4 ) = max(0, d p (Ψ Γ (s 1 ), Ψ Γ (s 2 )) − d p (Ψ Γ (s 3 ), Ψ Γ (s 4 ))).We further replace this loss by a smooth upper bound using log-sum-exp, leading to our final Wasserstein-based loss function L = (s1,s2)∈P (s3,s4)∈N s3∈{s1,s2} log 1 + exp dp(ΨΓ(s1),ΨΓ(s2))−dp(ΨΓ(s3),ΨΓ(s4)) .(13) Equation 13 is of similar structure as other losses used in the literature, including the angular triplet loss (Wang et al, 2017), the lifted structured loss (Oh Song et al, 2016), and the N-pair loss (Sohn, 2016).It remains to specify how positive pairs P and negative pairs N are sampled for each stochastic gradient descent step.We use the approach of Sohn (2016) for generating P and N , which has been shown to give state-of-the-art performance in several studies (Sohn, 2016;Wu et al, 2017;Yuan et al, 2017), in particular outperforming triplet-based sampling (Schroff et al, 2015) and lifted structure sampling (Oh Song et al, 2016).The approach constructs a batch of size 2N (where N is an adjustable parameter) by sampling from the training data N pairs of instances P = {(s i , s + i )} N i=1 from N different classes, such that each pair (s i , s + i ) is a positive pair from a different class.From the sampled batch, a set of N (N − 1) negative pairs is constructed by setting . Note that Equation 13 can be computed by first computing the embeddings of the 2N instances in the batch, and then computing the overall loss.Thus, although the computation is quadratic in N , the number of evaluations of the deep neural network model Γ is linear in the batch size.

Empirical Study
We empirically study the proposed method in three biometric identification domains involving human eye movements, accelerometer-based observation of human gait, and EEG recordings.As an ablation study, we specifically evaluate which impact the different components of our proposed method -the metric learning approach, the use of quantile layers to fit the distribution of activations of filters across a sequence, and the Wasserstein-based distance function -have on overall performance.

Data Sets
We study biometric identification based eye movements, the gait, or the EEG signal of a subject.In all domains, the data consist of sequential observations of the corresponding low-level sensor signal -gaze position from an eye tracker, accelerometer measurements, or EEG measurements -for different subjects.
The task is to identify the subject based on the observed sensor measurements.
The Dynamic Images and Eye Movements (DIEM) dataset (Mital et al, 2011) contains eye movement data of 210 subjects each viewing a subset of 84 video clips.The video clips are of varying length with an average of 95 seconds and contain different visual content, such as excerpts from sport matches, documentary videos, movie trailers, or recordings of street scenes.The data contain the gaze position on the screen for the left and the right eye, as well as a measurement of the pupil dilation, at a temporal resolution of 30 Hz.The eye movement data of a particular individual on a particular video clip is thus given by a sequence of six-dimensional vectors (horizontal and vertical gaze coordinate for left and right eye plus left and right pupil dilation), that is, D = 6 in the notation of Section 3. The average sequence length is 2850 and there are 5381 sequences overall.
The gait data we use come from a study by Ihlen et al (2015) who collected the daily movement activity of 71 subjects for a period of 3 consecutive days.The recorded data consists of time series of 3D accelerometer measurements recorded at a sampling rate of 100Hz.For each point in time, the measurement is a D = 6 dimensional vector consisting of the acceleration and velocity in x, y, and z direction.In the original data set, a continuous measurement for 3 days has been carried out for each individual.These long measurements contain different activities, but also long idle periods (for example, during sleep).We concentrate on subsequences showing high activity, by dividing the entire recording for each subject into intervals of length one minute, and then selecting for each subject the 30 subsequences that had the largest standard deviation in the 6-dimensional observations.This resulted in 2130 sequences overall (30 for each of the 71 subjects), with a length of T = 6000 per sequence.
The EEG data we use come from a study by Zhang et al (1995) who conducted EEG recording sessions with 121 subjects, measuring the signal from 64 electrodes placed on the scalp at a temporal resolution of 256Hz of the subjects while viewing an image stimulus.The original aim of the study was to find a correlation between EEG observations and genetic predisposition to alcoholism, but as subject identifiers are available for all recordings the data can also be used in a biometric setting.Each subject completed between 40 and 120 trials with 1 second of recorded data per trial.The resulting data therefore consist of sequences of D = 64 dimensional vectors with a sequence length of 256 (one trial for one subject).

Problem Setting
As usual in metric learning, we study a setting in which there are distinct sets of subjects at training and test time.The embedding model is first trained on a set of training subjects.On a separate and disjoint set of test subjects, we then evaluate to what degree the learned embedding assigns small distances to pairs of test sequences from the same subject, and large distances to pairs of sequences from different subjects.This reflects an application setting in which new subjects are registered in a database without retraining the embedding model.It also naturally allows the identification of imposters, that is, subjects who have never been observed (neither during training nor in the database of registered subjects) and try to gain access to the system.In all three domains, we therefore first split the data into training and test data, such that there is no overlap in subjects between the two.For training the embedding model, we use data of 105 of the 210 subjects (eye movements), 36 of 71 subjects (gait data), or 61 of 121 subjects (EEG data).For the eye movement domain, we additionally ensure that there is no overlap in visual stimulus (video clips) between training and test data by splitting the set of all videos into training and test videos and only keeping the respective sequences in the training and test data.During training, each sequence constitutes an instance and the subject its class, and we train either embedding models using metric learning as discussed in Section 4 or, as a baseline, multiclass classification models (see Section 5.3 for details).We also set apart the data of 20% of the training individuals as validation data to tune model hyperparameters.
At test time, we simulate a biometric application setting by first sampling, for each test subject, a random subset of the sequences available for that subject as instances that are put in an enrollment database.We then simulate that we observe additional sequences from a subject which are compared to the sequences of all subjects in the enrollment database.An embedding is good if the distance between these additional sequences and the enrollment sequences of the same subject is low, compared to the distance to the enrollment sequences of other subjects.More precisely, for each subject we use all except five of the sequences available for that subject as enrollment sequences.We then study how well the subject can be identified based on observing n of the remaining sequences, for n ∈ {1, .., 5}.Given observed sequences s 1 , ..., s n (representing a subject that is unknown at test time), we compute distances to all subjects j as d j = 1 n n i=1 d(s i , s ij ) where s ij is the sequence of subject j in the enrollment database with minimal distance to s i .Here, the definition of the distance function d is method-specific (see below for details).
We first study a verification scenario.This is the binary problem of deciding if the observed sequences s 1 , ..., s n match a particular subject j, by comparing the computed distance d j to a threshold value.Varying the threshold trades of false-positive versus false-negative classifications, yielding a ROC curve and AUC score.Note that the verification scenario also covers the setting in which in imposter is trying to get access to a system as a particular user; the falsepositive rate is the rate at which such imposters would be accepted.
We then study a multiclass identification scenario, where we use the model to assign the observed sequences s 1 , ..., s n to a subject enrolled in the database (the subject j * = arg min j d j ).This constitutes a multiclass classification problem for which (multiclass) accuracy is measured.In this experiment, we also vary the number of subjects under study, by randomly sampling a subset of subjects which are enrolled in the database; the same subset of subjects is observed at test time.The identification problem becomes more difficult as the number of subjects increases.
We finally study the robustness of the model to imposters in the multiclass identification scenario, an experiment we denote as multiclass imposters.This reflects applications in which access to a system does not require a user name, because the system tries to automatically identify who is trying to gain access.
In this experiment, half of the test subjects play the role of imposters who are not registered in the enrollment database.As in the multiclass identification setting, observed sequences are matched to the enrolled subject with minimum distance.This minimum distance is then compared to a threshold value; if the threshold is exceeded, the match is rejected and the observed sequences are classified as belonging to an imposter.Varying the threshold trades off false-positives (match of imposter accepted) versus false-negatives (match of a subject enrolled in the database rejected), yielding a ROC curve and AUC.Correctly rejecting imposters is harder in this setting because it suffices for an imposter to successfully impersonate any enrolled subject.In this experiment we also vary the number of subjects enrolled in the database.
In all three scenarios, the split of sequences into enrollment and observed sequences is repeated 10 times to obtain standard deviations of results.Moreover, accuracies and AUCs will increase with increasing n, as identification becomes easier the more data of an unknown subject is available.
time, distance is given by negative cosine similarity.This baseline uses metric learning, but neither quantile layers nor the Wasserstein-based loss function.QP-CLS: This baseline uses the same network architecture and flattened quantile embedding as QP-NPL, but feeds the flattened embedding vector into a dense classification layer with softmax activation.The models is trained in a classification setting using multiclass crossentropy.Distance at test time is given by negative cosine similarity.This model is identical to the model presented in Abdelwahab and Landwehr (2019), except that we remove the final classification layer at test time to generate embeddings for novel subjects.
For all methods, training is carried out using the Adam optimizer with learning rate 0.0001 for 50000 iterations, and the regularizer of the PReLU activation function is tuned as a hyperparameter on the validation set as in (Abdelwahab and Landwehr, 2019).

Results
We present and discuss empirical results for the different domains in turn.

Eye Movements
Table 1, upper third, shows area under the ROC curve for all methods and varying number n of observed sequences in the eye movement domain.Comparing QP-WL and QP-NPL, we observe that the Wasserstein-based loss introduced in Section 4, which works on the distributional embedding given by the piecewise linear approximations of the quantile functions, clearly outperforms flattening the distributional embedding and using N -pair loss.Comparing MP-NPL with QP-NPL and QP-WL shows that using quantile layers improves accuracy compared to max-pooling even if the quantile embedding is flattened (and more so if Wasserstein-based loss is used).Classification training (QP-CLS) reduces accuracy compared to metric learning (QP-NPL).As expected, AUC increases with the number n of sequences observed at test time.Figure 2 shows ROC curves in the verification setting for n = 5.
Figure 3 (left) shows multiclass identification accuracy for n = 5 observed sequences as a function of the fraction of the 105 subjects who are enrolled.Relative results for the different methods are similar as in the verification setting.Accuracy decreases slightly when more subjects are enrolled, as the multiclass problem becomes more difficult.Figure 3 (right) shows the robustness of the model to multiclass imposters as a function of the fraction of the 105 subjects who are enrolled (up to 50%, as half of the subjects are imposters).We observe that QP-WL is much more robust to imposters than the baseline methods.
In the eye movement domain, we also compare against the state-of-the-art model by Jager et al ( 2019 compute from our raw data.We replicate the setting of Jager et al ( 2019) by training the model using multiclass classification and using the last layer before the classification layer as the embedding at test time.The Jager et al. ( 2019) architecture cannot deal with variable-lenght sequences, we therefore split the variable-length sequences in our data into shorter sequences of fixed length, namely the length of the shortest sequence (27 seconds).For a fair comparison, we also simplify the data for our model in this experiment: only the average gaze point rather than left and right gaze point separately, removing pupil dilation, and using the same fixed-length sequences.Figure 4 shows ROC curves for the verification scenario (left) and identification accuracy (center) as well as AUC in the imposter scenario for our model QP-WL on the simplified data and Jager et al. ( 2019).Comparing to Figure 2 and Figure 3 we observe that accuracies are reduced for our model by using the simplified data, but the model still outperforms Jager et al. ( 2019) by a wide margin.We that the model of Jager et al ( 2019) is focused on microsaccades, which are likely not detectable in our data due to the low temporal resolution (30Hz compared to 1000Hz in the study by Jager et al ( 2019)), which might explain the relatively poor performance of the model on our data.

Gait
Table 1, center third, shows area under the ROC curve for all methods and varying number n of observed sequences in the gait domain.We observe the ordering in terms of relative performance between the different methods as in the eye movements domain, with clear benefits when using the proposed loss function based on Wasserstein distance (QP-WL versus QP-NPL), when using quantile layers instead of max-pooling aggregation (QP-WL and QP-NPL versus MP-NPL), and when using metric learning rather than classificationbased training (QP-NPL versus QP-CLS). Figure 5 shows ROC curves for verification at n = 5 in the gait domain.but the difference between QP-WL and QP-NPL less pronounced.Figure 6 (right) shows robustness to multiclass imposters, with again a clear advantage of QP-WL over the baselines.

EEG
Table 1, bottom third, shows area under the ROC curve for all methods and varying number n of observed test sequences in the EEG domain.Relative performance of methods is generally similar as in the other two domains.QP-WL clearly outperforms the closest baseline, reducing 1-AUC by between 56% = 1) and 80% (n = 5).Figure 7 shows ROC curves in the verification setting.Figure 8 (left) and Figure 8 (right) show identification accuracy as a function of the fraction of subjects enrolled and robustness of the models to multiclass imposters.As in the gait domain, differences are more pronounced in the latter setting.

Conclusions
We developed a model for distributional embeddings of variable-length sequences using deep neural networks.Building on existing work on quantile layers, the model represents an instance by the distribution of the learned deep features across the sequence.We developed a distance function for these distributional embeddings based on the Wasserstein distance between the corresponding distributions, and from this distance function a loss function for performing metric learning with the proposed model.A key point about the model is end-to-end learnability: by using piecewise linear approximations of the quantile functions, and based on those providing a closed-form solution for the Wasserstein distance, gradients can be traced through the embedding and loss calculations.In our empirical study, distributional embeddings outperformed standard vector embeddings by a large margin on three data sets from different domains.

Fig. 2 Fig. 3
Fig. 2 ROC curves in the eye movement domain for all methods using n = 5 observed sequences.Shaded region in ROC curves indicates standard error.

Fig. 4
Fig. 4 Comparison between QP-WL and Jager et al. (2019) in the eye movement domain: area under ROC curve in verification scenario (left), identification accuracy in multiclass identification scenario (center), and robustness of model to multiclass imposters (right).In this experiment, the data is simplified for both methods to match the requirements of Jager et al. (2019), see text for details.Results of QP-WL therefore differ from results presented in Figure 2 and Figure 3. Error bars indicate the standard error.

Fig. 5 Fig. 6
Fig. 5 ROC curves in the gait domain for all methods using n = 5 observed sequences.Shaded region in ROC curves indicates standard error.

Fig. 7 Fig. 8
Fig. 7 ROC curves in the EEG domain for all methods using n = 5 observed sequences.Shaded region in ROC curves indicates standard error.