Learning similarity measures from data

Defining similarity measures is a requirement for some machine learning methods. One such method is case-based reasoning (CBR) where the similarity measure is used to retrieve the stored case or set of cases most similar to the query case. Describing a similarity measure analytically is challenging, even for domain experts working with CBR experts. However, data sets are typically gathered as part of constructing a CBR or machine learning system. These datasets are assumed to contain the features that correctly identify the solution from the problem features, thus they may also contain the knowledge to construct or learn such a similarity measure. The main motivation for this work is to automate the construction of similarity measures using machine learning, while keeping training time as low as possible. Our objective is to investigate how to apply machine learning to effectively learn a similarity measure. Such a learned similarity measure could be used for CBR systems, but also for clustering data in semi-supervised learning, or one-shot learning tasks. Recent work has advanced towards this goal, relies on either very long training times or manually modeling parts of the similarity measure. We created a framework to help us analyze current methods for learning similarity measures. This analysis resulted in two novel similarity measure designs. One design using a pre-trained classifier as basis for a similarity measure. The second design uses as little modeling as possible while learning the similarity measure from data and keeping training time low. Both similarity measures were evaluated on 14 different datasets. The evaluation shows that using a classifier as basis for a similarity measure gives state of the art performance. Finally the evaluation shows that our fully data-driven similarity measure design outperforms state of the art methods while keeping training time low.


Introduction
Many artificial intelligence and machine learning (ML) methods, such as k-nearest neighbors (k-NN) rely on a similarity (or distance) measure [21] between data points.In Case-based reasoning (CBR) a simple k-NN or a more complex similarity function is used to retrieve the stored cases that are most similar to the current query case.The similarity measure used in CBR systems for this purpose is typically built as a weighted Euclidean similarity measure (or as a weight matrix for discrete and symbolic variables).Such a similarity measure is designed with assistance of domain experts by adjusting the weights for each attribute of the cases to represent how important they are (one example can be seen in [32], or generally described in chapter 4 of [3]) In many situations the design of such a function is non-trivial.Domain experts with an understanding of CBR or machine learning are not easily available.However, before or during most CBR projects, data is gathered that relates to the problem being solved by the CBR system.This data is used to construct cases for populating the case base.If the data is labeled according to the solution/class, it can be used to learn a similarity measure that is relevant to the task being solved by the system.Exploring efficient methods of learning similarity measures and improving on them is the main motivation of this work.
v Z m 4 n 9 e L z X h j Z 9 z m a Q G J V s s C l N B T E x m f 5 M h V 8 i M y C y h T H F 7 K 2 F j q i g z N p 2 y D c F b f n m V t O s 1 7 7 J W v 7 + q N t w i j h K c w h l c g A f X 0 I A 7 a E I L G I z g G V 7 h z R H O i / P u f C x a 1 5 x i 5 g T + w P n 8 A W 0 W j d U = < / l a t e x i t > v Z m 4 n 9 e L z X h j Z 9 z m a Q G J V s s C l N B T E x m f 5 M h V 8 i M y C y h T H F 7 K 2 F j q i g z N p 2 y D c F b f n m V t O s 1 7 7 J W v 7 + q N t w i j h K c w h l c g A f X 0 I A 7 a E I L G I z g G V 7 h z R H O i / P u f C x a 1 5 x i 5 g T + w P n 8 A W 0 W j d U = < / l a t e x i t > p z < l a t e x i t s h a 1 _ b a s e 6 4 = " Q I o h 6 b D 6 h h f o z k H r g s K Q + S l a 1 I c = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m q o M e C F 4 8 V 7 Q e 0 o W y 2 k 3 b p Z h N 2 N 0 I N / Q l e P C j i 1 V / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 0 n G q G D Z Z L G L V C a h G w S U 2 D T c C O 4 l C G g U C 2 8 H 4 Z u a 3 H 1 F p H s s H M 0 n Q j + h Q 8 p A z a q x 0 n / S f + u W K W 3 X n I K v E y 0 k F c j T 6 5 a / e I G Z p h N I w Q b X u e m 5 i / I w q w 5 n A a a m X a k w o G 9 M h d i 2 V N E L t Z / N T p + T M K g M S x s q W N G S u / p 7 I a K T 1 J A p s Z 0 T N S C 9 7 M / E / r 5 u a 8 N r P u E x S g 5 I t F o W p I C Y m s 7 / J g C t k R k w s o U x x e y t h I 6 o o M z a d k g 3 B W 3 5 5 l b R q V e + i W r u 7 r N T d P I 4 i n M A p n I M H V 1 C H W 2 h A E x g M 4 R l e 4 c 0 R z o v z 7 n w s W g t O P n M M f + B 8 / g B q C I 3 T < / l a t e x i t > p z < l a t e x i t s h a 1 _ b a s e 6 4 = " Q I o h 6 b D 6 h h f o z k H r g s K Q + S l a 1 I c = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m q o M e C F 4 8 V 7 Q e 0 o W y 2 k 3 b p Z h N 2 N 0 I N / Q l e P C j i 1 V / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 0 n G q G D Z Z L G L V C a h G w S U 2 D T c C O 4 l C G g U C 2 8 H 4 Z u a 3 H 1 F p H s s H M 0 n Q j + h Q 8 p A z a q x 0 n / S f + u W K W 3 X n I K v E y 0 k F c j T 6 5 a / e I G Z p h N I w Q b X u e m 5 i / I w q w 5 n A a a m X a k w o G 9 M h d i 2 V N E L t Z / N T p + T M K g M S x s q W N G S u / p 7 I a K T 1 J A p s Z 0 T N S C 9 7 M / E / r 5 u a 8 N r P u E x S g 5 I t F o W p I C Y m s 7 / J g C t k R k w s o U x x e y t h I 6 o o M z a d k g 3 B W 3 5 5 l b R q V e + i W r u 7 r N T d P I 4 i n M A p n I M H V 1 C H W 2 h A E x g M 4 R l e 4 c 0 R z o v z 7 n w s W g t O P n M M f + B 8 / g B q C I 3 T < / l a t e x i t > s z < l a t e x i t s h a 1 _ b a s e 6 4 = " B 3 6 W 2 C x A i t F W o a k H T w X s d 4 T c k U A = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m q o M e C F 4 8 V 7 Q e 0 o W y 2 k 3 b p Z h N 2 N 0 I N / Q l e P C j i 1 V / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 0 n G q G D Z Z L G L V C a h G w S U 2 D T c C O 4 l C G g U C 2 8 H 4 Z u a 3 H 1 F p H s s H M 0 n Q j + h Q 8 p A z a q x 0 r / t P / X L F r b p z k F X i 5 a Q C O R r 9 8 l d v E L M 0 Q m m Y o F p 3 P T c x f k a V 4 U z g t N R L N S a U j e k Q u 5 Z K G q H 2 s / m p U 3 J m l Q E J Y 2 V L G j J X f 0 9 k N N J 6 E g W 2 M 6 J m p J e 9 m f i f 1 0 1 N e O 1 n X C a p Q c k W i 8 J U E B O T 2 d 9 k w B U y I y a W U K a 4 v Z W w E V W U G Z t O y Y b g L b + 8 S l q 1 q n d R r d 1 d V u p u H k c R T u A U z s G D K 6 j D L T S g C Q y G 8 A y v 8 O Y I 5 8 V 5 d z 4 W r Q U n n z m G P 3 A + f w B u m o 3 W < / l a t e x i t > s z < l a t e x i t s h a 1 _ b a s e 6 4 = " B 3 6 W 2 C x A i t F W o a k H T w X s d 4 T c k U A = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m q o M e C F 4 8 V 7 Q e 0 o W y 2 k 3 b p Z h N 2 N 0 I N / Q l e P C j i 1 V / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 0 n G q G D Z Z L G L V C a h G w S U 2 D T c C O 4 l C G g U C 2 8 H 4 Z u a 3 H 1 F p H s s H M 0 n Q j + h Q 8 p A z a q x 0 r / t P / X L F r b p z k F X i 5 a Q C O R r 9 8 l d v E L M 0 Q m m Y o F p 3 P T c x f k a V 4 U z g t N R L N S a U j e k Q u 5 Z K G q H 2 s / m p U 3 J m l Q E J Y 2 V L G j J X f 0 9 k N N J 6 E g W 2 M 6 J m p J e 9 m f i f 1 0 1 N e O 1 n X C a p Q c k W i 8 J U E B O T 2 d 9 k w B U y I y a W U K a 4 v Z W w E V W U G Z t O y Y b g L b + 8 S l q 1 q n d R r d 1 d V u p u H k c R T u A U z s G D K 6 j D L T S g C Q y G 8 A y v 8 O Y I 5 8 V 5 d z 4 W r Q U n n z m G P 3 A + f w B u m o 3 W < / l a t e x i t > p x < l a t e x i t s h a 1 _ b a s e 6 4 = " J G y G E m k S r v O h j D / Y n B J n O r R + o t 0 = " > A A A B 7 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l U 0 G P B i 8 c K p i 2 0 o W y 2 k 3 b p Z h N 2 N 2 I J / Q 1 e P C j i 1 R / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e m A q u j e t + O 6 W 1 9 Y 3 N r f J 2 Z W d 3 b / + g e n j U 0 k m m G P o s E Y n q h F S j 4 B J 9 w 4 3 A T q q Q x q H A d j i + n f n t R 1 S a J / L B T F I M Y j q U P O K M G i v 5 a T 9 / m v a r N b f u z k 8 w c w x 8 4 n z / j A 5 D + < / l a t e x i t > Fig. 1: Illustration of problem and solution spaces [19].p y and p z are two problem descriptions with features describing a problem each of which has a corresponding (s y and s z ) solution in solution space.δ p illustrates the distance between a new problem p x and a stored problem p y .Correspondingly δ s is the distance between the solution s y and the solution s x which is the (unknown) ideal solution to p x .A fundamental assumption in CBR is that if the similarity between p x and p y is high then the similarity between the unknown solution s x to p y is high (δ p ≈ δ s ): Similar problems have similar solutions.
In the CBR literature, similarity measurement is often described in terms of problem-and solution spaces.Problem space is where the features of a problem describe the problem; this is often called feature space in non-CBR ML literature.Solution space, also referred to as target space, is populated by points describing solutions to points in the problem space.The function that maps a point from the problem space to its corresponding point in the solution space is typically the goal of supervised machine learning.This is illustrated in Figure 1.
A similarity measure in the problem space represents an approximation of the similarity between two cases or data points in the solution space (i.e.whether these two cases have similar or dissimilar solutions).Such a similarity measure would be of great help in situations where lots of labeled data is available, but domain knowledge is not available, or when the modeling of such a similarity measure is too complex.
Learned similarity measures can also be used in other settings, such as clustering.Another relevant method type is semi-supervised learning in which the labeled part of a dataset is used to cluster or label the unlabeled part.
How to automatically learn similarity measures has been an active area of research in CBR.For instance, Gabel et al. [10] train a similarity measure by creating a dataset of collated pairs of data points and their respective similarities.This dataset is then used to train a neural network to represent the similarity measure.In this method the network needs to extract the most important features in terms of similarity for both data points, then combine these features to output a similarity measure.Recent work (e.g.Martin et al. [22]) has used Siamese neural networks (SNN) [5] to learn a similarity measure in CBR.SNNs have the advantage of sharing weights between two parts of the network, in this case the two parts that extract the useful information from the two data points being compared.All of these methods for learning similarity measures have in common that they are trained to compare two data points and return a similarity measurement.Our work of automatically learning similarity measures is also related to the work done by Hüllermeier et al. on preference-based CBR [15,14].In this work the authors learn a preference of similarity between cases/data points, which represents a more continuous space between solutions than a typical similarity measure in CBR.This type of approach to similarity measures is similar to learning similarity measures by using machine learning models, in that both can always return a list of possible solutions sorted by their similarity.
In this work we have developed a framework to show the main differences between various types of similarity measures.Using this framework, we highlight the differences between existing approaches in Section 3.This analysis also reveals areas that have not received much attention in the research community so far.Based on this we developed two novel designs for using machine learning to learn similarity measures from data.Both of the two designs are continuous in their representation of the estimated solution space.
The novelty of our work is three-fold: First showing that using a classifier as a basis for a similarity measure gives adequate performance.Then we demonstrate similarity measure designed to use as little modeling as possible, while keeping training time low, outperforms state of the art methods.Finally to analyze the state of the art and compare it to our new similarity measure design we introduce a simple mathematical framework.
We show how this is a useful tool for analyzing and categorizing similarity measures.
The remainder of this paper describes our method in more detail.Section 2 describes the novel framework for similarity measurement learning, and Section 3 then summarizes previous relevant work in relation to this framework.In Section 4 we describe suggestions of new similarity measures, and how we design the experimental evaluation.Subsequently, in Section 5 we show the results of this evaluation.Finally, in Section 6 we interpret and discuss the evaluation results and give some conclusions.We present some of the limitations of our work as well as possible future paths of research.

A framework for similarity measures
We suggest a framework for analyzing different functions for similarity with S as a similarity measure applied to pairs of data points (x, y); where G(x) = x and G(y) = ŷ represents embedding or information extraction from data points x and y , i.e.G(•) highlights the parts of the data points most useful to calculate the similarity between them.An illustration of this process can be seen in Figure 2. C(G(x), G(y)) = C( x, ŷ) models the distance between the two data points based on the embeddings x and ŷ.The functions C and G can be either manually modeled or learned from data.With respect to this we will enumerate all of the different configurations of Equation 1 and describe their main properties and how they have been implemented in state of the art research.Note that we will use S(•) to annotate the similarity measurement and C(•) for the sub-part of the similarity measurement that calculates the distance between the two outputs of G( < l a t e x i t s h a 1 _ b a s e 6 4 = " d w y B o 9 K S a a Z M P g g Y R S F k + W O e c E g = " > A A A B 7 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l U 0 G P B i 8 c K p i 2 0 o W y 2 k 3 b p Z h N 2 N 2 I J / Q 1 e P C j i 1 R / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e m A q u j e t + O 6 W 1 9 Y 3 N r f J 2 Z W d 3 b / + g e n j U 0 k m m G P o s E Y n q h F S j 4 B J 9 w 4 3 A T q q Q x q H A d j i + n f n t R 1 S a J / L B T F I M Y j q  6 5 a / e I G Z p h N I w Q b X u e m 5 i / I w q w 5 n A a a m X a k w o G 9 M h d i 2 V N E L t Z / N T p + T M K g M S x s q W N G S u / p 7 I a K T 1 J A p s Z 0 T N S C 9 7 M / E / r 5 u a 8 N r P u E x S g 5 I t F o W p I C Y m s 7 / J g C t k R k w s o U x x e y t h I 6 o o M z a d k g 3 B W 3 5 5 l b R q V e + i W r u 7 r N T d P I 4 i n M A p n I M H V 1 C H W 2 h A E x g M 4 R l e 4 c 0 R z o v z 7 n w s W g t O P n M M f + B 8 / g B q C I 3 T < / l a t e x i t > e y < l a t e x i t s h a 1 _ b a s e 6 4 = " + y w 5 g j n x X l 3 P h a t a 0 4 x c w J / 4 H z + A F f C j c c = < / l a t e x i t > e y < l a t e x i t s h a 1 _ b a s e 6 4 = " + y w 5 g j n x X l 3 P h a t a 0 4 x c w J / 4 H z + A F f C j c c = < / l a t e x i t >

Solution space
s y < l a t e x i t s h a 1 _ b a s e 6 4 = " Fig. 2: Illustrating how G(•) from Equation 1adds another space, the embedding space, between the problem and the solution space [19] (see Figure 1).C(•) then combines the two embeddings of p y and p x (e y and e x respectively) and calculates the similarity δ e between them.The main assumption is that distance in embedding space (δ e ) is close to the distance in solution space (δ s ) ; if the embedded points e x and e y are similar, then the true (unknown) solution s x is similar to solution s y .
The main contribution of G(•) is to create a embedding space such that the distance in embeddings space (δ e ) is a better estimate of the distance in solution space (δ s ) than the distance in problem space (δ p ).
Type 1 A typical similarity measure in CBR systems would model C( x, ŷ) and G(•) from domain knowledge.Such a similarity measure is typically modeled by experts with the relevant domain knowledge together with CBR experts, who know how to encode this domain knowledge into the similarity measures.For example when modeling the similarity measure of cars for sale, where the goal is to model the similarity of cars in terms of their final selling price.In this example, domain experts may model the embedding function G(•) so that the amount of miles driven has a greater importance than the color of the car.C( x, ŷ) could be modeled such that difference in miles driven is less important than difference the number of repairs done on the car.More details and examples can be found in [7].Type 2 This type represents similarity measures that models G(•) and learns the function C( x, ŷ).In this context G(•) can be viewed as an embedding function.Since G(•) is not learned from the data it is not interesting to analyze it as part of learning the similarity measure, as processing the data through G(•) could be done in batch before applying the data to S(x, y).Learning C( x, ŷ) needs to be done with a dataset consisting of triplets of the data points x and ŷ, and s being the true similarity between x and ŷ.A special case of Type 2 is when G(•) is set to be the identity function I(x) = G(x) = x, while C(x, y) is learned from data.Examples of this type are presented for example in Gabel et al. [10] where the similarity measure always looks at the two inputs together, never separately.Type 3 In this type of similarity measure the embedding/feature extraction G(•) is learned and C( x, ŷ) is modeled.Typically the embedding function learned by G(•) resembles the function that is the goal during supervised machine learning.Within the similarity measurement x = G(x) is used as an embedding vector for calculating similarity, when in classification x would be the softmax vector output.Using a pre-trained classification model as a starting point for G(x) = x as input to e.g.C( x, ŷ) = x − ŷ 1 should give good results for similarity measurements if that model had high precision for classification within the same dataset.However it is not given that the best embedding vector for calculating similarity is the same as the embedding vector produced by a G(x) trained to do classification.Type 4 This measure is designed so that both G(•) and C( x, ŷ) are learned.
We will design, implement and evaluate similarity measures based on Type 1, Type 3, Type 2 and Type 4 in Section 4. These results will be shown in Sections 5.
To allow S as a similarity measurement for clustering e.g.k-nearest neighbors, a similarity measure should fulfill the following requirements: Symmetry S(x, y) = S(y, x) The similarity between x and y should be the same as the similarity between x and y.Non-negative S(x, y) ≥ 0|∀x, y The similarity between to data-points can not be negative.Identity S(x, y) = 1 ⇐⇒ x = y The similarity between two data-points should be 1 iff x is equal to y.
Some of these requirements are not satisfied by all types of similarity measures, i.e. symmetry is not a direct design consequence of Type 2 but of Type 3 if C( x, ŷ) is symmetric.Even if symmetry is not present in all similarity measures [30] it is important for reducing training time, as the training set size goes from N (N − 1) to N ( N 2 − 1).Symmetry also enables the similarity measure to be used for clustering.
In the next section, we will relate current state of the art to the framework in context of the different types.

Related work
To exemplify the framework presented in the previous section we will relate previous work to the framework and the types of similarity measurements that derive from the framework.This will also enable us to see possibilities for improvement and further research.
As stated in Section 1 our motivation is to automate the construction of similarity measures.Additionally, we would like to do this while keeping training time as low as possible.Thus we will not focus on Type 1 similarity measures as this type uses no learning.Both Type 2 and Type 4 require a different type of training dataset than a typical supervised machine learning dataset, as C(x, y) is typically dependent on the order of the data points (see Section 4).Thus given our initial motivation, Type 3 similarity measures seems to be the most promising type of similarity measure to focus on.However, it is worth investigating similarity measures of Type 4, to see if the added benefit of learning C(x, y) outweighs the added training time.Or if it is possible to make it symmetric (as defined in the previous section) so that the training time could become similar to Type 3.
Thus we will focus on summarizing related work from Type 3 similarity measures, but also add relevant work from Type 1, Type 2 and Type 4 for reference.
Type 1 is a type of similarity measure which is manually constructed.A general overview and examples of this type of similarity measure can be found in [7].Nikpour et al. [23] presents an alternative method which includes enrichment of the cases/data points via Bayesian networks.
Type 2 In Type 2 similarity measures only the binary C(x, y) operator of the similarity measure S(x, y) is learned, while G(•) is either modeled or left as the identity function (G(x) = I(x) = x).Stahl et al. have done a lot of work on learning Type 2 similarity measures from data or user feedback.In all of their work they formulate C(x, y) = w i * sim i (x i , y i ) where for each feature i, sim i is the local similarity measure and w i is the weight of that feature.In [27] Stahl et al. describe a method for learning the feature weights.
In [28] Stahl et al. introduce learning local similarity measures through an evolutionary algorithm (EA).First they learn attribute weights (w i ) by adopting the method previously described in [27].Then they use an EA to learn the local similarity measures for each feature (sim i (x, y)).In [29] Stahl and Gabel present work were they learn weights of a modeled similarity measure, and the local similarity for each attribute through an ANN.Reategui et al. [24] learn and repre-sent parts of the similarity functions (C( x, ŷ)) through ANN.Langseth et al. [18] learn similarity knowledge (C( x, ŷ)) from data using Bayesian networks, which still partially relies on modeling the Bayesian networks with domain knowledge.
Abdel-Aziz et al. [1] use the distribution of case attribute values to inform a polynomial local similarity function, which is better than guessing when domain knowledge is missing.So this method extracts statistical properties from the dataset to parametrize C( x, ŷ).
Gabel and Godehardt [10] use a neural network to learn a similarity measure.Their work is done in the context of Case-based Reasoning (CBR) which uses the measure to retrieve similar cases.They concatenate the two data points into one input vector.Thus in the context of our framework G(•) is modeled as a identify function I(x) = x and C(x, y) is learned.
Maggini et al. [21] uses SIMNNs which they also see as a special case of the Symmetry Networks [26] (SNs).In SIMMNs C( x, ŷ) and G(•) are both a function of both x and y data points and there is thus no distinct G(•).They also have a specialized structure imposed on their network to make sure the learned function is symmetric.SIMNN is in essence an extended version of a Siamese neural network, but without a distinct distance layer usually present in SNN architectures.They focus on the specific properties of the network architecture and the application of such networks in semi-supervised settings such as k-means clustering.The pair of data points (x and y) are being compared two times, the first time at the first hidden layer, then at the output layer.Since there are no learnable parameters before this comparison all the learning is done in C( x, ŷ) and G(x) is the activation function of the input layer.
Type 3 One way of looking at a similarity measure is as an inverse distance measure, as similarity is the semantic opposite of distance.There has been much work on learning distance measures.Most of this work can be categorized as a Type 3 similarity measure as the learning tasks only aims to learn the embedding function G(•) then combine the output of this function with a static C(•) (e.g. a L2 norm function).The most well known instance of a Type 3 learned distance measure is Siamese neural networks (SNNs), it is highly related to the Type 2 similarity measure by Maggini et al.'s Similarity neural networks (SIMNN) [21].
The main characteristic of SNNs is sharing the weights between the two identical neural networks.The data points we want to measure the similarity for are then input to these networks.This frees the learning algorithm of learning two sets of weights for the same task.This was first used in [5] (using C( x, ŷ) = cos( x, ŷ) and G(•) being learned from data) to measure similarity between signatures.Similar architectures are also discussed in [26].
Chopra et al. [6] uses a SNN for face verification, and pose the problem as an energy based model.The output of the SNN are combined through a L1 norm (absolute-value norm C( x, ŷ) = x − ŷ ) to calculate the similarity.They emphasize that using a L2 norm (Euclidean distance) as part of the loss function would make the gradient too small for effective gradient descent (i.e.create plateaus in the loss function).This work is closely related to Hadsell et al. [11], where they explain the contrastive loss function used for training the SNN (also used in [6,22]) by analogy of a spring system.
Related to this Vinyals et al. [31] uses a similar type of setup for matching an input data point to a support set.It is framed as a discriminative task, where they use two neural networks to parametrize an attention mechanism.They use these two networks to embed the two data points into a feature space where the similarity between them are measured.However, in contrast to SNNs and SIMNNs, their two networks for embedding the data points are not identical, as one network is tailored to embed a data point from the support set, but also given the rest of the support set.Thus the embedding of the support set data point is also a function of the rest of the support set.With C( x, ŷ) being modeled as a cosine softmax, this is similar to the examples of Type 3 similarity measures mentioned previously (e.g.[5,4]).However a major difference is that signal extraction functions are not equal: S(x, y) = C(f (x), g(x)) with f (x) = g(x) (only stating that f (•) may potentially equal g(•)).Since f (•) and g(•) are not sharing weights between them, the architecture is variant (or asymmetric) to the ordering of input pairs.Thus the architecture needs up to twice as much training to achieve the same performance as a SNN.
In much of the same fashion as Chopra et al. did in [6], Berlemot et al. [4] uses SNNs combined with an energy based model to build a similarity measure between different gestures made with smart phones.However they adapt the error estimation from using only separate positive and negative pairs to a training subset including; a reference sample, a positive sample and a negative sample for every other class.They train G(•) while keeping a static C( x, ŷ) = cos( x, ŷ).This training method of using triplets for training SNNs was also described by Lefebvre et al. [20].A similar approach can be seen in Hoffer et al. [13], however they do not use a set of negative examples per reference point for each class as Berlemont et al do.Instead they use triples of (x, x + , x − ), x being the reference point, x + being the same class and x − being a different class.
Koch et al. [16] uses a Convolutional Siamese Network (CSN), with G(•) implemented as a CNN and C( x, ŷ) implemented as L1( x, ŷ).This is done in a semi-supervised fashion for one-shot learning within image recognition.They learn this CSN for determining if two pictures from the Omniglot [17] dataset is within the same class.The model can then be used to classify a data point representing an unseen class by comparing it to a repository of class representatives (Support Set).
CSNs are also used in the context of CBR by Martin et al. [22] to represent a similarity measure in a CBR system.The CSN is trained with pairs of cases and the output is their similarity.During training they have to label pairs of cases as 'genuine' (both cases belong to the same class) or 'impostor' (the cases belong to different classes).This requires that the user has a clear boundary for the classes.In relation to our framework this similarity measure learns G(•), while C( x, ŷ) is static.With G(•) implemented as a convolutional neural network, and C( x, ŷ) implemented as Euclidean distance (L2 norm).
In general using SNNs for constructing similarity measures have a major advantage as you can easily adopt pre-trained models for G(•) to embedding/preprocess the data points.For example to train a model for comparing two images one could use ResNet [12] for G(•) then use the L1 norm as C( x, ŷ).This would be a very similar approach to the similarity measure used by Koch et al. [16] with S(x, y) = (G(x), G(y)) 1 , the main difference being that G(•) is designed for bigger pictures.
There are only very few examples of Type 4 similarity measures in the literature.In Zagoruyko and Komodakis's work [33] they investigate different types of architectures for learning image comparison using convolutional neural networks.In all of the architectures they evaluate C( x, ŷ) is learned, but in some of these architectures G(•) is not symmetric, i.e.S(x, y) = C(G(x), H(y)) where G(x) = H(x).Arandjelović and Zisserman's work [2] use a very similar method to many Type 3 similarity measures for calculating similarity.However their input data is always pairs of two different data types and is as such different from most of the other relevant work leaving G(•) unsymmetrical just as in Zgoruyko et al. [33] and Vinyals et al. [31].In contrast to the Type 3 similarity measures including [31], Arandjelović et al. also learns C( x, ŷ), which they call a fusion layer.
All similarity measure of Type 3 we found in the literature use a loss function that includes feedback from the binary operator part of S (C( x, ŷ)).In the case of SNNs even if C(x, y) is non-symmetric (C(x, y) = C(y, x)) the loss for each network would be equal as they are equal and share weights.That means that ordering of the two data points being compared during training has no effect, i.e. the training effect of (x, y) is equal to that of (y, x).This means a lot of saved time during training, as the training dataset could be halved without any negative effect on performance.
However the implementation of C( x, ŷ) would then decide how much training one would need to adapt a pre-trained model from classifying single data points to measuring similarity between them.One could view the process of starting with a pre trained model for the dataset, then training the model with loss coming from C( x, ŷ) as adapting the model from classification to similarity measurement.
One way of creating a Type 3 similarity measure using a minimal amount of training would be to pre-train a network on classifying individual data points.Then apply that network as G(•) that feeds into a C( x, ŷ) = x − ŷ in a similarity measurement.Evaluation of such a similarity measurement has not been reported in literature, and such a similarity will be explored in the next section.

Method
The framework presented in Section 2 and the subsequent analysis of previous relevant work presented in Section 3 shows that there are unexplored opportunities within research on similarity measurements.
Given the initial motivation we seek methods that work well in domains where domain knowledge is very resource demanding.This requires that as much as possible of the similarity measure S(x, y) = C(G( x), G( ŷ)) is learned from data rather than modeled from domain knowledge.There are some exceptions to this, such as applying general binary operations, such as norms (e.g.L1 or L2 norm), on the two data points ( x and ŷ) preprocessed by G(•).In these cases there is little domain expertise involved in designing C( x, ŷ) other than intuition that the similarity of two data points is closely related to the norm between x and ŷ.
The most promising type of similarity measures from this perspective are Type 3 and Type 4 where G(•) is learned in Type 3 and both C(x, y) and G(•) are learned in Type 4.However, to test any new design we need to have reference methods to compare against.For reference, we chose to implement one Type 1 similarity measure, two similarity measures of Type 2 (including Gabel et.al's) similarity measures and Chopra et.al's Type 3 similarity measure.The Type 1 similarity mea-sure uses a similarity measure that weights each feature uniformly.The Type 2 is identical to the Type 1 similarity measure except that it uses a local similarity function for each feature which is parametrized by statistical properties of the values of that feature in the dataset.
One unexplored direction of creating similarity measures is creating a SNN similarity measure (Type 3) through training G(•) as a classifier on the dataset later being used for measuring similarity.Then using that trained G(•) to construct a SNN similarity measure.This is in contrast to the usual way of training SNNs (as seen in e.g [6,5]) where the loss function is a function of pairs of data points, not single data points.The motivation for exploring this type of design is that it shows the similarity measuring performance of using networks pre-trained on classifying data points directly as part of a SNN similarity measure.This will be detailed in Subsection 4.2.
Finally, we will explore Type 4 similarity measures which have seen little attention in research so far.To make our design as symmetric as possible we will use the same design as SNNs for G(•) and introduce novel design to also make C( x, ŷ) symmetric.That way our design is fully symmetric (invariant to ordering of the input pair) and thus training becomes much more efficient.All of the details of this design will be shown in Subsection 4.3.Both of our proposed similarity methods implement G(•) as neural networks.The Type 4 measurement design implements C( x, ŷ) as a combination of a static binary function and neural network.

Reference similarity measures
As a reference for our own similarity measure we implemented several reference similarity measures of Type 1, Type 2 and Type 3. First we implemented a standard uniformly weighted global similarity (t 1,1 ) measure which can be defined as: where sim i (x i , y i ) denotes the local similarity of the i-th of M attributes.In t 1,1 all weights and local similarity measures are uniformly distributed, and not parametrized by the data.
We extended this with a Type 2 similarity measure t 2,1 , which is based on the work from Abdel-Aziz et al. [1], where the local similarity measures are parametrized by the data from the corresponding features.
Furthermore we implemented a Type 2 similarity measure gabel as described by Gabel et al. [10].The architecture of gabel can be seen in Figure 3.

I(x)
< l a t e x i t s h a 1 _ b a s e 6 4 = " S / 2 F I 6 K s a n X Q A A g 8 g m f w C t 6 s J + v F e r c + Z q U F a 9 5 z C P 7 A + v o B H a i h / A = = < / l a t e x i t >

I(x)
< l a t e x i t s h a 1 _ b a s e 6 4 = " S / 2 F I 6 K s a n X Q A A g 8 g m f w C t 6 s J + v F e r c + Z q U F a 9 5 z C P 7 A + v o B H a i h / A = = < / l a t e x i t > x < l a t e x i t s h a 1 _ b a s e 6 4 = " j + q k o z X 2 n e o F C k n P g X O r 7 n Z w z 1 s = " x < l a t e x i t s h a 1 _ b a s e 6 4 = " j + q k o z X 2 n e o F C k n P g X O r 7 n Z w z 1 s = " < l a t e x i t s h a 1 _ b a s e 6 4 = " S / 2 F I 6 K s a n X Q A A g 8 g m f w C t 6 s J + v F e r c + Z q U F a 9 5 z C P 7 A + v o B H a i h / A = = < / l a t e x i t >

I(x)
< l a t e x i t s h a 1 _ b a s e 6 4 = " S / 2 F I 6 K s a n X Q A A g 8 g m f w C t 6 s J + v F e r c + Z q U F a 9 5 z C P 7 A + v o B H a i h / A = = < / l a t e x i t > C(x, y) < l a t e x i t s h a 1 _ b a s e 6 4 = " s s r K H A i e w S t 4 s 5 6 s F + v D + p y V L l l 5 z w H 4 A + v 7 B 7 K P p f s = < / l a t e x i t > C(x, y) < l a t e x i t s h a 1 _ b a s e 6 4 = " s s r K H Lastly we implemented the Type 3 similarity measure chopra described by Chopra et al.We did not implement the extension done to the contrastive loss function as seen in [4,20] as the change in the training dataset would be too big.This change would make comparisons between the methods harder to justify.Also none of these works showed any comparisons with previous SNNs in terms of any increased performance in relation to regular contrastive loss.

Type 3 similarity measure
In this subsection we will detail how we model the Type 3 similarity measure t 3,1 which uses an embedding function G(•) trained as a classifier.This embedding function maps the input point, x, to an embedding space (see Figure 2) which dimensions represents the probabilities of x belonging to a class.We then model the similarity measure between two points as the a static function (C(•) between their two respective embeddings.
For this we choose the L2 norm.So replacing C(•) for L2 in Equation 1: C( x, ŷ) = x − ŷ 2 , we can redefine Equation 1 to be: where G(•) outputs the modeled solution as a n dimensional vector (the feature vector output from the network to the softmax function for n classes) for the case based on the problem attributes of data point x.This means that if the G(x) evaluates the two cases as very similar in terms of classification G(x) ≈ G(y) and G(x) − G(y) ≈ 0 then S(x, y) ≈ 1.0.This architecture is also illustrated in Figure 4 Mathisen x < l a t e x i t s h a 1 _ b a s e 6 4 = " j + q k o z X 2 n e o F C k n P g X O r 7 n Z w z 1 s

G(y) < l a t e x i t s h a 1 _ b a s e 6 4 = " W d d 5 w 7 D Y D m M u E P m 7 T w n N c M h a v u 0 = " >
n G U T g r d 4 8 j J p 1 2 v e e c 2 9 u 6 g 0 3 H k c J X A M T k A V e O A S N M A t a I I W Q O A R P I N X 8 G Y 9 W S / W u / U x K 1 2 x 5 j 1 H 4 A + s r x 8 c E q H 7 < / l a t e x i t > kx ŷk 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 N U X x q e C z q

Similarity
Fig. 4: Architecture of the t 3,1 similarity measure where G(•) is trained to output softmax vectors for classification and the similarity is calculated as a modeled L2 norm between these two vectors (Type 3).
Following the model for the t 3,1 similarity measure we define the loss estimate as log-loss between G(x) = x and t, where t is the is true classification softmax vector, x is the class probability vector output from G(x).Notice that the error estimate of t 3,1 does not depend on the output of C( x, ŷ).
A data-set of size N would then be defined as: where x N is the problem part of the N -th data point and t N is the solution/target part of the N -th data point.
If the relation between the problem part of the data point (x) and the solution part of the data point (t) is complex, the network architecture needs to be able to represent the complexity and any generalizations of patterns in that complexity.

Type 4 similarity measure
As previously explained, Type 4 similarity measures are currently the most unexplored type of similarity measure.It is also the type of similarity measure that requires the least amount of modeling.In principle Type 4 similarity measures learns two things: G(•) learns a useful embedding, where the most useful parts of x and y is encoded into x and ŷ.C( x, ŷ) learns how to combine those embeddings to calculate the similarity of the original x and y.
In Type 4 similarity measures both C( x, ŷ) and G(•) are learned.In our Type 4 similarity method we will use an ANN to represent both G(•) and C( x, ŷ).This has the advantage that the learning on S(x, y) is an end to end process.The loss computed after C( x, ŷ) can be used to compute gradients for both C( x, ŷ) and G(•).C( x, ŷ) will learn the binary combination best suited to calculate the similarity of the two embeddings, while G(•) will learn to embed the two data points optimally for calculating their similarity in C( x, ŷ).In principle any ML method could be used to learn G(•) and C( x, ŷ), but not all ML methods lend themselves naturally to back-propagating the error signal from C( x, ŷ) through G(•) and back to the input.
We define our Type 4 similarity method, Extended Siamese Neural Network (eSN N ) as shown in figure 5.
Given that this similarity method outputs similarity and the loss function is a function of the input, we get a new general loss function for similarity, defined per data-point as follows: where s is the true similarity of case x and y.Since this loss function is dependent on pairs of data points and the true similarity between them, we need to create a new dataset based on the original dataset.This new dataset consists of triplets of two different data points from the original dataset and the true similarity of these two data points: where s N is 1 if x N and y N belong to the same class and 0 otherwise.It is worth to mention that this dataset is of size N (N − 1) for the similarity measure to train on all possible combinations of the N data points.Certain similarity measure architectures (e.g.gabel from Gabel et al. [10] or Zagoruyko et al.'s similarity measures [33] ) needs to train on a dataset containing all possible combinations of data points (of size N (N −1)) as training on the triplet (x, y, s) does not guarantee that the model learns that S(y, x) = s.Thus the training dataset must also include the triplet (y, x, s).However this may be largely avoided by using architectures (such as those seen in SNNs and SNs) that exploit symmetry and weight sharing.To achieve this we modeled C(x, y) as a ANN where the first lay.er is an absolute difference operator on two vectors: z = ABS( x − ŷ).where z is the element-wise absolute difference between x and x.The rest of C( x, x) is hidden layers of ANN that operate on z.This way C( x, x) becomes invariant to the ordering of inputs to S(x, y).Consequently the model only needs to train on order-invariant unique pairs of data points, reducing training set size from N (N − 1) to N ( N 2 − 1).The resulting architecture of eSN N can be seen in 5.
In Subsection 4.2 we argue why G(•) trained to correctly classify its input is a good embedding function x < l a t e x i t s h a 1 _ b a s e 6 4 = " j + q k o z X 2 n e o F C k n P g X O r 7 n Z w z 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 + i f W x D 4 7 s / d m j e P o w i O w D E o A x 9 c g B q 4 A X X Q A A g 8 g m f w C t 6 s J + v F e r c + Z q U F a 9 5 z C P 7 A + v o B G o y h + g = = < / l a t e x i t >

G(x)
< l a t e x i t s h a 1 _ b a s e 6 4 = " 5 + i f W x D 4 7 s / d m j e P o w i O w D E o A x 9 c g B q 4 A X X Q A A g 8 g m f w C t 6 s J + v F e r c + Z q U F a 9 5 z C P 7 A + v o B G o y h + g = = < / l a t e x i t >

ABS(x ŷ)
< l a t e x i t s h a 1 _ b a s e 6 4 = " P T < l a t e x i t s h a 1 _ b a s e 6 4 = " S 3 g B m k y c m + b j F U Y I a N s j g S S S g g A = " > A A A B 7 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B a h X k q i g h  This also introduced an opportunity for exploring the relative importance of the embedding function G(•) and the binary similarity function C(•) in terms of the performance of the similarity measure.This could be done by weighting the three different loss signals ( x, ŷ and similarity as shown in Figure 5) during training and measuring the effect of that weighting on the performance.We define our weighted loss function as such: where L s (•) is defined in Equation 5, t x is the true label of data point x, t y is the true label of data point y and L c (v 1 , v 2 ) is the categorical cross entropy loss between two softmax vectors.We use this formula and tested with different 100 different values of α in the range [0, 1] to find the weighting scheme best for performance.The results can be seen in Figure 6.

Network parameters
For all similarity measures tested using ANN and all datasets except MNIST, G(•) and C(•) where implemented with two hidden layers of 13 nodes.This was done to replicate the network parameters used by Gabel et al. to ensure we had comparable results.For the MNIST dataset test both chopra and eSN N used three hidden layers of 128 nodes for G(•), and the same for C(•) Other than the network architecture we also wanted to choose which optimizer to use for learning the ANN model.We wanted to chose the RProp [25] to be more comparable with the results from Gabel et al. which also used RProp optimizer.Our tests seen in Figure 7 shows that RProp outperforms all other optimizer tested (ADAM and RMSProp).This is consistent with the results reported by Florescu and Igel [9].This should hold true until the the run-time performance of RProp degrades with dataset size, as RProp uses full batch sizes.

Evaluation protocol and implementation
The different similarity measures presented earlier in this section requires different training data sets.The reference Type 1 similarity measures (t  training.While t 2,1 and t 3,1 does not require a similarity training consisting of triplets as described in Equations 6.All other similarity measures evaluated was trained using identical training datasets.As a result, all similarity measures were trained on a dataset consisting of all possible combinations of data points (as explained in 4.3) as this is required by the gabel similarity measure.However, results highlighting the differences in training performance when using the different training datasets can be seen in Figure 13.The results reported in the next section are all 5-fold stratified cross validation repeated 5 times for robustness.The performance reported is an evaluation of each similarity measurement using the part of the dataset (validation partition) that was not used for training.Using the similarity measure being evaluated, we computed the similarity between every data point in the validation partition (V ) and every data point in the training partition (T ).For each validation data point (x v ∈ V ) we find the data point in the training set T with the highest similarity (x t = arg max xi∈T (S(x v , x i ))).If x t has the same class as x v from the validation partition, we scored it as 1.0, if not, we scored it as 0.0.
The implementation was done in Keras1 with Tensorflow as backend.The methods was measured on 14 different datasets available from the UCI machine learning repository [8].Results was recorded after 200 epochs and 2000 epochs (the latter number to be consistent with Gabel et al. [10]) to reveal how fast the different methods were achieving their performance.

Experimental evaluation
To calculate the performance of our similarity measure we chose to use the same method of evaluation as Gabel et al. [10] to make the similarity metrics more easily comparable.In addition this evaluation method of using publicly available datasets from the UCI machine learning repository [8] make the results easy to reproduce.We selected a subset of the original 19 datasets, choosing not to use regression datasets, resulting in a set of 14 classification datasets.The datasets' numerical features were all normalized, categorical features were replaced by a one-hot vector.
The validation losses from evaluating the similarity measures on the 14 datasets are shown in Figures 8 and 9. Figure 8 shows the results after training for 200 epochs, while Figure 9 shows the results after 2000 epochs.This has been done to illustrate how the differences between the similarity measures develop during training.In addition the 200 and 2000 epoch runs are independent runs (i.e. Figure 9 is not the same models as seen in Figure 8 1800 epochs later) The numbers that are the basis of these figures are also reported in Table 2 for 200 epochs and Table 3 for 2000 epochs.The tables are highlighted to show the best result per dataset.In some cases the differences between two methods for one dataset was smaller than the standard deviation thus highlighting more than one result.
Finally, to illustrate that eSN N scales to larger datasets we report results from the MNIST dataset in Figure 10.The MNIST results are not validation results, as calculating the similarity between all the data points in the test set and the training set (as per the evaluation protocol described in Section 4.5) was too resource demanding.
Table 2 shows the validation losses of the different similarity measures on the different datasets.Our proposed Type 4 similarity measure eSN N has 11% less validation loss than the second best (Type 3) similarity measure chopra (Chopra et al. [6]).The other Type 3 similarity measures follow with t 3,1 having 51% higher loss and gabel (Gabel et al. [10]) with 52% more loss.The Type 1 similarity measure had 61% more loss but managed to be the best similarity measure for the glass dataset.At last Type 2 similarity measure had 69% higher loss than eSN N on average.
The results when training for 2000 epochs are quite different from those at 200 epochs, as seen by how much closer the other similarity measures are in Figure 9 than in Figure 8. eSN N still outperforms all other similarity measures on average, but the second best similarity measure t 3,1 is much closer with just 6.9% higher To illustrate the difference in terms of training efficiency between different types similarity measure, we show the validation loss for gabel, chopra and eSN N during training.Specifically, for each epoch we test the loss of each similarity measure by the same method as described in subsection 4.5.Figure 11 and Figure 12 shows validation loss during training of eSN N , chopra and gabel on the UCI Iris and Mammographic mass datasets [8] respectively.This exemplifies the training performance of these methods in relation to the Iris and Mammographic mass dataset results reported in the tables above.One can also note that in training for the Mammographics dataset as seen in Fig. 11 chopra achieves the same performance as eSN N .In contrast, while training on the Iris dataset (as seen in Fig. 12), which is a less complex dataset than the Mammographic dataset, chopra achieves the same performance as eSN N .than a similarity measurement that is not invariant to input ordering, while still having excellent relative performance.
Finally in Figure 14 and 15 we show how eSN N can be used for semi-supervised clustering.The figures show PCA and T-SNE clustering of embeddings produced untrained and trained eSN N networks respectively from the MNIST dataset.The embeddings are the vector output of G(•) for each of the data points in the test set.The embeddings shown are computed from a test set that is not used for training.The figures show that eSN N learns a way to correctly cluster data points that it has not used for training.

Conclusions and future work
Section 5 shows that all of the learned similarity measures outperformed the classical similarity measure t 1,1 and also t 2,1 where the local (per feature) similarity measures were adapted to the statistical properties of the features [1].In practice one should weight the importance of each feature according to how important it is in terms of similarity measurement.In many situations the number of possible attributes to include in such a function can be overwhelming, and modeling them in the way we did in t 1,1 and t 3,1 also overlooks possible co-variations between the attributes.Both of these problems can be addressed using the proposed  3 method to model the similarity using machine learning on a dataset that maps from case problem attributes to case solution attributes.However one should be careful to note that all of the learned similarity measure are built on the assumption that similar data points have similar target values (δ s ≈ δ e ≈ δ p in Figure 2).If this assumption does not hold, learning the similarity measure might be much more difficult.
We have also presented a framework for how to analyze and group different types of similarity measures.We have used this framework to analyze previous work and highlight different strengths and weaknesses of the different types of similarity measures.This also highlighted unexplored types of similarity measures, such as Type 4 similarity measures.
As a result we designed and evaluated a Type 3 similarity measure t 3,1 based on a classifier.The evaluations showed that using a classifier as a basis for a similarity measure achieves comparable results to state of the art methods, while using much less training evaluations to achieve that performance.We then combined strengths from Type 4 and Type 3 similarity measures into a new Type 4 similarity measure, called Extended Siamese Neural Networks (eSN N ), which: -Learns an embedding of the data points using G(•) in the same way as Type 3 similarity measures, but using shared weights in the same way as SNNs to make the operation symmetrical.-Learns C( x, ŷ), thus enabling extended performance in relation to SNN and other Type 3 similarity measurements.-Restricts C( x, ŷ) to make it invariant to input ordering, and thus obtaining end to end symmetry through the similarity measure.
Keeping eSN N symmetrical end-to-end enables the user of this similarity measure to train on much smaller datasets than required by other types of similarity measures.Type 3 measures based on SNNs also have this advantage, but our results show that the ability to learn C( x, ŷ) is important for performance in many of the 14 datasets we tested.Our results showed that eSN N outperformed state of the art methods on average over the 14 datasets by a large margin.We also demonstrated that eSN N achieved this performance much faster given the same dataset than current state of the art.In addition, the symmetry of eSN N enables it to train on datasets that are orders of magnitude smaller.Our case-study of clustering embeddings produced from eSN N show that the eSN N model can be used for semi-supervised clustering.Finally we demonstrated that the training of this similarity measure scales to large datasets like MNIST.Our main motivation for this work was to automate the construction of similarity measures while keeping training time as low as possible.We have shown that eSN N is a step towards this.Our evaluation shows that it can learn similarity measures across a wide variety of datasets.We also show that it scales well in comparison to similar methods and scales to datasets of some size such as MNIST.
The applications for eSN N as a similarity measure are not only as a similarity measure in a CBR system.It can also be used for semi-supervised clustering: training eSN N on labeled data, then use the trained eSN N for clustering unlabeled data.In much the same fashion it could be used for semi-supervised clustering, using eSN N as a matching network in the same fashion as the distance measure is applied in Vinyals et al. [31].
In continuation of this work we would like to explore what is actually encoded by learned similarity measures.This could be done by varying the different features of a query data point q in S(x, q) and discovering when that data point would change from one class to another (when the class of the closest other data point changes) -this would form a multi-dimensional boundary for each class.This boundary could be explored to determine what the similarity measure actually encoded during the learning phase.
Another interesting avenue of research would be to apply recurrent neural networks to embed time series into embedding space (see Figure 2) to enable the similarity measure to calculate similarity between time series which is currently a non-trivial problem.
The architecture of similarity measures still require more investigation, e.g. is the optimal embedding from G(•) different from the softmax classification vector used in normal supervised learning?If so it is worth investigating why it is different.
t e x i t s h a 1 _ b a s e 6 4 = " E v N T 1 L 8 u Q 57 6 h 5 6 H Y T j f b x d d w T U = " >A A A B 6 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K k k V 9 F j w 4 r G i / Y A 2 l M 1 2 0 i 7 d b M L u R g i h P 8 G L B 0 W 8 + o u 8 + W / c t j l o 6 4 O B x 3 s z z M w L E s G 1 c d 1 v Z 2 1 9 Y 3 N r u 7 R T 3 t 3 b P z i s H B 2 3 d Z w q h i 0 W i 1 h 1 A 6 p R c I k t w 4 3 A b q K Q R o H A T j C 5 n f m d J 1 S a x / L R Z A n 6 E R 1 J H n J G j Z U e k k E 2 q F T d m j s H W S V e Q a p Q o D m o f P W H M U s j l I Y J q n X P c x P j 5 1 Q Z z g R O y / 1 U Y 0 L Z h I 6 w Z 6 m k E W o / n 5 8 6 J e d W G Z I w V r a k I X P 1 9 0 R O I 6 2 z K L C d E T V j v e z N x P + 8 X m r C G z / n M k k N S r Z Y F K a C m J j M / i Z D r p A Z k V l C m e L 2 V s L G V F F m b D p l G 4 K 3 / P I q a d d r 3 m W t f n 9 V b b h F H C U 4 h T O 4 A A + u o Q F 3 0 I Q W M B j B M 7 z C m y O c F + f d + V i 0 r j n F z A n 8 g f P 5 A 2 i E j d I = < / l a t e x i t > p y < l a t e x i t s h a 1 _ b a s e 6 4 = " E v N T 1 L 8 u Q 5 7 6 h 5 6 H Y T j f b x d d w T U = " > A A A B 6 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K k k V 9 F j w 4 r G i / Y A 2 l M 1 2 0 i 7 d b M L u R g i h P 8 G L B 0 W 8 + o u 8 + W / c t j l o 6 4 O B x 3 s z z M w L E s G 1 c d 1 v Z 2 1 9 Y 3 N r u 7 R T 3 t 3 b P z i s H B 2 3 d Z w q h i 0 W i 1 h 1 A 6 p R c I k t w 4 3 A b q K Q R o H A T j C 5 n f m d J 1 S a x / L R Z A n 6 E R 1 J H n J G j Z U e k k E 2 q F T d m j s H W S V e Q a p Q o D m o f P W H M U s j l I Y J q n X P c x P j 5 1 Q Z z g R O y / 1 U Y 0 L Z h I 6 w Z 6 m k E W o / n 5 8 6 J e d W G Z I w V r a k I X P 1 9 0 R O I 6 2 z K L C d E T V j v e z N x P + 8 X m r C G z / n M k k N S r Z Y F K a C m J j M / i Z D r p A Z k V l C m e L 2 V s L G V F F m b D p l G 4 K 3 / P I q a d d r 3 m W t f n 9 V b b h F H C U 4 h T O 4 A A + u o Q F 3 0 I Q W M B j B M 7 z C m y O c F + f d + V i 0 r j n F z A n 8 g f P 5 A 2 i E j d I = < / l a t e x i t >s y < l a t e x i t s h a 1 _ b a s e 6 4 = " V D o u d Y E w + U M i D 5 D 3 C O a z p u i 5 1 K s = " > A A A B 6 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K k k V 9 F j w 4 r G i / Y A 2 l M 1 2 0 i 7 d b M L u R g i h P 8 G L B 0 W 8 + o u 8 + W / c t j l o 6 4 O B x 3 s z z M w L E s G 1 c d 1 v Z 2 1 9 Y 3 N r u 7 R T 3 t 3 b P z i s H B 2 3 d Z w q s y < l a t e x i t s h a 1 _ b a s e 6 4 = " V D o u d Y E w + U M i D 5 D 3 C O a z p u i 5 1 K s = " > A A A B 6 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K k k V 9 F j w 4 r G i / Y A 2 l M 1 2 0 i 7 d b M L u R g i h P 8 G L B 0 W 8 + o u 8 + W / c t j l o 6 4 O B x 3 s z z M w L E s G 1 c d 1 v Z 2 1 9 Y 3 N r u 7 R T 3 t 3 b P z i s H B 2 3 d Z w q W n I X P 0 9 k d N Y 6 0 k c 2 s 6 Y m p F e 9 m b i f 1 4 3 M 9 F N k H O Z Z g Y l W y y K M k F M Q m a f k w F X y I y Y W E K Z 4 v Z W w k Z U U W Z s P h U b g r f 8 8 i p p X d S 9 y 7 p 7 f 1 V r u E U c Z T i B U z g H D 6 6 h A X f Q B B 8 Y c H i G V 3 h z p P P i v D s f i 9 a S U8 w c w x 8 4 n z 8 r O I 7 b < / l a t e x i t > p x < l a t e x i t s h a 1 _ b a s e 6 4 = " J G y 4 c 6 b w 4 7 8 7 H o r X k F D P H 8 A f O 5 w 8 a s o 7 R < / l a t e x i t > e x < l a t e x i t s h a 1 _ b a s e 6 4 = " d w y B o 9 K S a a Z M P g g

Fig. 3 :
Fig. 3: Architecture of a ANN similarity measure as used in Gabel [10] (Type 2), where G(•) is set to be the identity function G(x) = I(x) = x.
g h P 1 d 0 e K Q p k v p y t D p I Z y 1 s v F / 7 x e o o J z N 6 V R n C g S 4 e m g I G F Q c Z j H B H 0 q C F Z s r A n C g u p d I R 4 i g b D S Y V Z 0 C P b s y f O k f V y 3 T + r 2 z W m 1 Y R V x l M E e 2 A c 1 Y I M z 0 A D X o A l a A I N H 8 A x e w Z v x Z L w Y 7 8 b H t L R k F D 27 4 A + M r x + I 4 a E 5 < / l a t e x i t >ABS(x ŷ)< l a t e x i t s h a 1 _ b a s e 6 4 = " P T g h P 1 d 0 e K Q p k v p y t D p I Z y 1 s v F / 7 x e o o J z N 6 V R n C g S 4 e m g I G F Q c Z j H B H 0 q C F Z s r A n C g u p d I R 4 i g b D S Y V Z 0 C P b s y f O k f V y 3 T + r 2 z W m 1 Y R V x l M E e 2 A c 1 Y I M z 0 A D X o A l a A I N H 8 A x e w Z v x Z L w Y 7 8 b H t L R k F D2 7 4 A + M r x + I 4 a E 5 < / l a t e x i t > Similarity ŷ < l a t e x i t s h a 1 _ b a s e 6 4 = " T U 8 5 E D q q 2 z u n d e d + 8 u a g 1 3 H k c Z H I F j c A o 8 c A U a 4 B Y 0 Q Q t g 8 A i e w S t 4 c 5 6 c F + f d + Z i 1 l p z 5 z C H 4 U 8 7 X D 8 o 7 p B I = < / l a t e x i t > ŷ < l a t e x i t s h a 1 _ b a s e 6 4 = " T U 8 5 E D q q 2 z u n d e d + 8 u a g 1 3 H k c Z H I F j c A o 8 c A U a 4 B Y 0 Q Q t g 8 A i e w S t 4 c 5 6 c F + f d + Z i 1 l p z 5 z C H 4 U 8 7 X D 8 o 7 p B I = < / l a t e x i t > x < l a t e x i t s h a 1 _ b a s e 6 4 = " W k L 5 bf N S m m 5 X i p c 2 V Y V A k K 0 3 o d U = " > A A A C H n i c b V D L S s N A F J 3 4 r P U V d e k m G A R X J f G B L g t u X F a w D 2 h C m U y m 7 d B 5 h J l J t Y R 8 i R t / x Y 0 L R Q R X + j d O 2 i y 0 9 c I w h 3 P u n b n n R A k l S n v e t 7 W 0 v L K 6 t l 7 Z q G 5 u b e / s 2 n v 7 L S V S i X A T C S p k J 4 I K U 8 J x U x N N c S e R G L K I 4 n Y 0 u i 7 0 9 h h L R Q S / 0 5 M E h w w O O O k T B L W h e v Z F w P E 9 E o x B H m f B G C O d d / 0 w C y J B Y z V h 5 s p c P 8 + r w R D q m Z 4 9 5 H n P d r 2 a N y 1 n E f g l c E F Z j Z 7 9 G c Q C p Q x z j S h U q u t 7 i Q 4 z K D V B F J v n U 4 U T i E Z w g L s G c s i w C r O p v d w 5 N k z s 9 I U 0 h 2 t n y v 6 e y C B T x a 6 m k 0 E 9 V P N a Q f 6 n d V P d v w o z w p N U Y 4 5 m H / V T 6 m j h F F k 5 M Z H G M J 0 Y A J E k Z l c H D a G E S J t E q y Y E f 9 7 y I m i d 1 v y z m n d 7 7 t a 9 M o 4 K O A R H 4 A T 4 4 B L U w Q 1 o g C Z A 4 B E 8 g 1 f w Z j 1 Z L 9 a 7 9 T F r X b L K m Q P w p 6 y v H 8 i 1 p B E = < / l a t e x i t > x < l a t e x i t s h a 1 _ b a s e 6 4 = " W k L 5 b f N S m m 5 X i p c 2 V Y V A k K 0 3 o d U = " > A A A C H n i c b V D L S s N A F J 3 4 r P U V d e k m G A R X J f G B L g t u X F a w D 2 h C m U y m 7 d B 5 h J l J t Y R 8 i R t / x Y 0 L R Q R X + j d O 2 i y 0 9 c I w h 3 P u n b n n R A k l S n v e t 7 W 0 v L K 6 t l 7 Z q G 5 u b e / s 2 n v 7 L S V S i X A T C S p k J 4 I K U 8 J x U x N N c S e R G L K I 4 n Y 0 u i 7 0 9 h h L R Q S / 0 5 M E h w w O O O k T B L W h e v Z F w P E 9 E o x B H m f B G C O d d / 0 w C y J B Y z V h 5 s p c P 8 + r w R D q m Z 4 9 5 H n P d r 2 a N y 1 n E f g l c E F Z j Z 7 9 G c Q C p Q x z j S h U q u t 7 i Q 4 z K D V B F J v n U 4 U T i E Z w g L s G c s i w C r O p vd w 5 N k z s 9 I U 0 h 2 t n y v 6 e y C B T x a 6 m k 0 E 9 V P N a Q f 6 n d V P d v w o z w p N U Y 4 5 m H / V T 6 m j h F F k 5 M Z H G M J 0 Y A J E k Z l c H D a G E S J t E q y Y E f 9 7 y I m i d 1 v y z m n d 7 7 t a 9 M o 4 K O A R H 4 A T 4 4 B L U w Q 1 o g C Z A 4 B E 8 g 1 f w Z j 1 Z L 9 a 7 9 T F r X b L K m Q P w p 6 y v H 8 i 1 p B E = < / l a t e x i t > G(y) < l a t e x i t s h a 1 _ b a s e 6 4 = " W d d 5 w 7 D Y D m M u E P m 7 T w n N c M h a v u 0 = " > A A A C G 3 i c b V D L S s N A F J 3 4 r P U V d e k m W I S 6 K U k V d F l w o c s K 9 g F N K J P J t B 0 6 j z A z q Z S Q / 3 D j r 7 h x o Y g r w Y V / 4 y T t Q l s P D H M 4 5 1 7 u v S e M K V H a d b + t l d W 1 9 Y 3 N 0 l Z 5 e 2 d 3 b 9 8 + O G w r k r d 4 8 j J p 1 2 v e e c 2 9 u 6 g 03 H k c J X A M T k A V e O A S N M A t a I I W Q O A R P I N X 8 G Y 9 W S / W u / U x K 1 2 x 5 j 1 H 4 A + s r x 8 c E q H 7 < / l a t e x i t > G(y) < l a t e x i ts h a 1 _ b a s e 6 4 = " W d d 5 w 7 D Y D m M u E P m 7 T w n N c M h a v u 0 = " > A A A C G 3 i c b V D L S s N A F J 3 4 r P U V d e k m W I S 6 K U k V d F l w o c s K 9 g F N K J P J t B 0 6 j z A z q Z S Q / 3 D j r 7 h x o Y g r w Y V / 4 y T t Q l s P D H M 4 5 1 7 u v S e M K V H a d b + t l d W 1 9 Y 3 N 0 l Z 5 e 2 d 3 b 9 8 + O G w r k 0 L 2 r e Z c 2 7 u 6 r U 3 T y O I p z A K V T B g 2 u o w y 0 0 o Q U M J D z D K 7 w 5 D 8 6 L 8 + 5 8 L F o L T j 5 z D H / g f P 4 A K J + P V Q = = < / l a t e x i t > C(•) < l a t e x i t s h a 1 _ b a s e 6 4 = " S 3 g B m k y c m + b j F U Y I a N s j g S S S g g

Fig. 5 :
Fig. 5: Architecture of a eSN N where we combine the symmetry of SNNs with the ability to learn C( x, ŷ).C( x, ŷ) is expanded in this picture to highlight the ABS( x − ŷ) operation done as the first operation of C( x, ŷ) to keep C invariant to the ordering of inputs.It also illustrates the two additional loss signals to G(•) which helps train the similarity measure.

Figure 6 Fig. 6 :
Figure 6  seems to indicate that α = 0.15 is ideal for this dataset.We have used α = 0.15 throughout the experiments reported in Section 5.

Fig. 7 :
Fig. 7: Testing how the RProp algorithm performs in comparison with ADAM and RMSProp.Our proposed architecture performs best using the RProp algorithm (5-fold cross validation and repeated 5 times).

1 Fig. 9 :Fig. 10 :
Fig. 9: Performance of eSN N in comparison to reference similarity measures and state of the art similarity methods over all test datasets trained over 2000 epochs.

Figure 13 Fig. 11 :
Figure 13 shows the validation loss during training when chopra and eSN N are using a training dataset of size N and gabel is using a training dataset of size N (N − 1).This figure illustrates how much fewer evaluations a SNN similarity measure like chopra or symmetric Type 4 similarity measure such as eSN N needs

Fig. 12 :
Fig.12: Validation retrieval loss during training on the Iris UCI ML dataset[8] .Since chopra starts out with very low validation loss.It seems probable that the static L1 norm C( x, x) used by chopra is close to optimal for correctly identifying if the two data points belong to the same class or not.The performance increase done by chopra is a slight optimization of G(•).The performance increase done during training by gabel and eSN N is mainly by learning a C( x, x) equivalent in function to that used by chopra, and secondary a slight optimization of G(•).eSN N catches up to chopra in performance after around 20 epochs, however gabel takes longer (5% validation loss at 2000 epochs) as shown in Table3

Fig. 13 :
Fig. 13: Validation retrieval loss during training on the balance dataset, which illustrates the difference in amount of evaluations needed to achieve acceptable performance.Chopra achieves good performance very quickly, but is outperformed by eSN N soon.Both have very good performance before having evaluated less (N ) data points than used by one epoch needed by gabel (N (N − 1))

Fig. 14 :
Fig. 14: PCA clustering showing the two first principal components (P CA1 and P CA2) of the embeddings produced by eSN N from MNIST input before (14a) and after (14b) training.

Table 1 :
Table showing different types of similarity measures in our proposed framework.