A new training strategy for spatial transform networks (STN’s)

Spatial transform networks (STN) are widely used since they can transform images captured from different viewpoints to obtain an objective image. These networks use an image captured from any viewpoint as input and the desired image as a label. Usually, these images are segmented, but this could lead to convergence problems if the percentage of overlap between the segmented images is quite low. In this paper, we propose a new training method to facilitate the convergence of a STN in these cases, even when there is no overlap between the object’s projections in the two images. This new strategy is based on the incorporation of the distance transformation images to the training, thus increasing the useful image information to provide gradients in the loss function. This new training strategy has been applied to a real case, with images of Caenorhabditis elegans, and to a simulated case, which uses artificial images to ensure that there is no overlap between the images used for the assays. In the assays carried out with these datasets, we have shown that the training convergence is strengthened, reaching a precision level for IoU metric of 0.862 and 0.984, respectively, and the computational cost has been maintained compared to the assay with segmented images, for the real case.


Introduction
Since Jaderberg et al. [1] developed the spatial transformer network (STN), it has been employed to solve a multitude of tasks.In most cases, it is used as an intermediate phase that adapts the input images to tackle problems with greater efficiency.Among some of the most frequent problems in which this type of network is used, are the classification task [1][2][3][4], the object detection task [5][6][7][8] or the correspondence task [9,10].
On developing STNs, Jaderberg et al. [1] presented its use for the classification task, in which the objective of the STN is to transform the input image to make the study element more easily identifiable, in this case a dataset such as MNIST or Street View House Numbers (SVHN) to identify numbers and so on CUB-200-2011 birds dataset to identify the parts of birds.As in the two first datasets, the STN were also used in [2] to classify numbers printed on football players' jerseys.And, beyond the classification of numbers the STNs have been used in many other areas such as the classification of traffic signs in [3], or, in the field of the medicine, to recognize tumor cells in [4].
For the object detection task, the purpose of STN is to transform the image so that it is easier to recognize the characteristics of the object to be identified.For example, in [5] thermal images of power stations were transformed to detect the different elements, or in [6] where the people in images captured with fish-eyes cameras were detected.They have also been used to detect a section of images that have some special characteristic in [7].Or for crowd counting in [8], but in this case the STN have been applied to a video dataset instead of an image dataset.To do so, they divide the video frames into blocks and seek to predict the trajectory of the people in each block.Then they apply the corresponding transformation and compare this new activation map with the ground truth for the next frame, if it differs, it means that the crowd has changed.
Finally, regarding the correspondence task, which is the application studied in this paper, the STNs try to modify the appearance of an image to transform into another appearance used as a ground truth.This application is explained in [9], which seeks to give a more realistic appearance to synthetic images of the fruit flies D. rerio and the worm C. elegans.To do so, synthetic images have been segmented as input and real images segmented as target, thus, the STN calculates the transformation required to give to the synthetic images the appearance of real images.And we also find this problematic in [10], where researches try to obtain a 3D model of a face from information taken from images of the same person from different viewpoints.
Here, the problem we study is similar to that posed in [9], in that the STN is used to transform images by segmenting the object of interest.But unlike [9], here we seek to transform images in which the overlap of the object's projections between the input images and the target image is not ensured.
Images of C. elegans have been used as a dataset.These were obtained using two micro-cameras with different viewpoints.Of these two cameras, one of them, which will be referred to as Micro 1 from hereon, captures images in which the worm is centered, while the second, which we will call Micro 2, captures images in which the worm has been displaced slightly toward the top right corner.
The principal objective of the STN is to give the images obtained by Micro 2 a similar appearance to those of Micro 1, to solve the correspondence problem and to be able to use these images in future works.
In our case, as seen in Fig. 1, the worm represents a small portion of the images there may be no overlap between the projections of the two worms depending on their position.
To avoid this, in the first instance, it was tried to use images of distances to the edges of the object to perform the training.The use of this type of images, which is novel in itself, increase the useful image information allowing convergence in these cases, but the precision is reduced.
For this reason, several new strategies have been proposed to favor the convergence in these cases and avoid this loss of precision.The one that gives the best result consists of a hybrid strategy that begins training alternately with transformed distance images and segmented images and switches to only segmented images to achieve better precision when convergence is already assured.

Network architecture
The structure of the STN is divided in two different parts.On the one hand, a localization neural network, and on the other hand, a grid generator that uses the result of the first one to obtain the grid and apply it to the corresponding images to obtain the desired image, as can be seen in Fig. 2.
The network employed for this study was the same as in [9].In it, the first part is responsible for locating the pixels of the object of interest within the image.In turn, the localization network can be divided into two stages.
The first stage is made up of a sequence of convolutional layers, each followed by a maxpooling layer and by a SeLU activation layer, except the last one, which is only followed by a SeLU activation layer.Therefore, this part has a total of five convolutional layers, five SeLU activation layers and four maxpooling layers, in the order shown.Table 1 shows a summary of the main characteristics of each layer.
Then its second section adapts the information obtained in the previous stage to define the parameters of the transformation to be performed.This section is made up by two fully connected layers, between which a SeLU activation layer is placed.The result of the last layer is a vector of 12 elements, whose values are reorganized in a matrix of 3 rows and 4 columns which represent the transformation Fig. 1 Images obtained from Micro 1 camera (left) and Micro 2 camera (right).These images are not exactly the images captured by the cameras, but have had a filter applied to increase the contrast to be carried out (Eq. 1).This matrix serves as input for the grid generator.
Finally, the second section of the STN uses the information provided by the localization neural network to obtain the grid to be applied on the input images in order to make it resemble the target image.In this case, two PyTorch functions have been used: ''affine_grid'' and ''grid_sample''.
The former, ''affine_grid'', is in charge of obtaining the grid.As inputs this function uses the 3 Â 4 matrix obtained in the localization network and the grid size to be obtained, which must be the same as the target images size.
The latter, ''grid_sample'', obtains the final transformed image, using the grid to be applied and the input image, through a bilinear interpolation.
These two functions allow the calculation of two types of transformations, on the one hand, affine transformations and, on the other hand, projective transformations.In our case, projective transformation has been used, since this is the most appropriate transformation for the images employed, which is why the ''affine_grid'' input has the size 3 Â 4, in the case that the affine transformation is used, this matrix must have size 2 Â 3.
Also, in the ''grid_sample'' function, the ''padding_mode'' parameter has been set to ''edge'' to extend the colors of the edge pixels to the pixels that are outside the image instead of turning them black.

Simulator
A virtual camera simulator has been developed to generate synthetic images, in order to control the characteristics of the network input images.It should be noted that the main function of the simulator is not to expand the real images dataset, but to generate an alternative dataset to analyze specific cases.
The simulator enables the projection of the points defining the object of interest by means of the projection matrix extracted from the camera features (focal length, height, inclination...), so that in the resulting image it projects the object of interest according to the camera features.
First the morphology of the object to be captured must be defined, for which the points of its silhouette must be defined.As the shape of the object is not a critical aspect, to simplify its definition, a rectangle was generated, since it could be defined by the four vertices.In addition to its shape, its size must also be defined, which is set at a number of pixels similar to that of C. elegans in real images, i.e., approximately 80 Â 300 pixels.
Subsequently, in order to grant greater variability to the images, certain parameters were established to randomly change the characteristics of the rectangles.On the one hand, their size can vary from 50% to 150% of the base size in each of the directions independently.On the other hand, the rectangle that is originally oriented vertically is randomly rotated around its center.And finally, as in the real images the positioning of the worm in Micro 1 is not fully centred, a small random displacement has also been added in each of the two axes.Once the object represented by the simulator has been defined, the next step is to determine the projection matrices of the two cameras.To do this, the cameras were calibrated; ensuring that the image produced has a similar appearance to the real images.
Consequently, the simulator is capable of generating images of rectangles with a random position and shape captured from the perspective of the two cameras.A sample of these images is shown in Fig. 3.

Dataset
From the viewpoint of the dataset, there are two types, on the one hand, the dataset of real images, and on the other, the dataset of the simulated images, as already mentioned in the previous section, both cases will be analyzed separately.

Dataset of C. elegans
The C. elegans dataset was obtained by image capture system [11] in the laboratory.To capture the images two cameras were used (Micro 1 and Micro 2).These cameras take captures with a resolution of 1944 Â 2592 pixels.
The images used as network inputs (Micro 2) maintain their original capture size, but in order to reduce memory use and ensure that all the information present in the target image is found in the initial image, Micro 1 images have been cropped to half their size, thus target image size is 972 Â 1296 pixels.
Given the costly capture process, the C. elegans dataset had a total of 975 pairs of images.The 975 image pairs with a similar appearance to those shown in Fig. 1 were processed to obtain the segmented images and distance transformation using the functionalities of the OpenCV library.
Depending on the assays, the dataset may be formed by pairs of segmented C. elegans images (Fig. 4a, b) or by pairs of distance-to-the-edge transformed C. elegans images (Fig. 4c, d), or by both types, depending on the case.

Synthetic dataset
The synthetic dataset has 1000 pairs of images, approximately the same as the C. elegans dataset.And, as in the case of the C. elegans dataset there are two types of images, segmented images and distance-transformed images.
By default, the simulator returns the segmented images of the rectangles.To obtain the distance-transformed images, a processing stage was required.A sample of the images returned by the simulator are those represented in Fig. 5a-d.These images have been grouped identically to the C. elegans dataset depending on the case studied.

Metrics
Different types of metrics were selected to estimate the performance of the assays; all of them have been applied on segmented images since the main objective is to transform the Micro 2 image so that the segmentations of the worms or the rectangles coincide.
On the one hand, because of the size of the object of interest, the number of non-coincident pixels between the two images was counted (Eq.2) and the mean, maximum, and minimum values of all pairs of images were recorded.In this way, the error was better characterized.
Where ''X'' and ''Y'' are the resulting segmented image and the target segmented image, respectively.On the other hand, we decided to use the IoU (Intersection over Union) metric, since it reflects exactly what we set out to achieve.This metric calculates the ratio between the intersection and the union of the elements of the two segmented images, as reflected in Eq. 3.  IoU where ''X'' is the segmented worm of the transformed image and ''Y'' is the segmented worm of the target image.This ratio provides the percentage of overlap that occurs between the worms in the two images.Lastly, in the assays with C. elegans images the training time has also been monitored to consider the computational cost, since if this is too high, a good alternative in terms of precision could be unfeasible due to such cost.

Hyper-parameter tuning
Hyper-parameter tuning is a fundamental task to ensure the convergence of a network.In our case, we look at the limitations of the network and the GPU to define them.
The first hyper-parameter to be fixed was batch size, which was set at eight due to the large size of the images, since this was the maximum size that met the memory limitation of the GPU used.
Regarding the optimizer, an Adam optimizer (Eq.4) was chosen to perform the convergence, whose learning rate value had been defined with one of the assays carried out.
where ''b 1 '' and ''b 2 '' are the attenuation coefficients of the moments of the first and second order respectively (the values of 0.9 and 0.99 are used, respectively),''e'' is the step value (which is usually a very small value), ''h'' are the parameters of the network, ''t'' indicates the iteration in which we are, ''g'' is the learning rate and ''L'' represents the loss function.
To define the stopping criterion, two different methods were followed depending on the dataset used.If the assay used the synthetic dataset, the training stopped after 150 times, while if the assay worked with the C. elegans dataset, it stopped when a given level of convergence was reached, which corresponded to a value for the IoU of 0.86 (this value is very close to the upper limit of precision), or when the training was unable to improve further, this occurs when it fails to improve in the last 20 times.
And the last one to be fixed was the loss function.To train the STN, the loss function used was the Mean Square Error (MSE) loss, with the formula represented in 5.

MSE loss
where ''n'' and ''m'' are the size of the images for each of their axes, ''c'' is the channel of the images and, ''X(n, m, c)'' and ''Y(n, m, c)'' the values of the pixels of the transformed image and the target image respectively.This function calculates the mean of the distances between the values of all the pixels of the two images, thus evaluating how similar they are.

Training methods, experiments and results
This section presents the methods developed together with the assays that have been performed and the results obtained.First, the assays in which only one type of image (segmented or distance-transformed) is used have been compared to determine which generates better results, and then some alternatives have been considered to improve the training performance.
3.1 Segmented image method compared to distance-transformed image method

Synthetic dataset
These two cases have been studied using the synthetic dataset to analyze some specific cases.For these assays, the computational cost was not calculated, as the aim was to find the method that provides the best performance, in terms of precision and robustness.In Algorithm 1, is shown the procedure of a training step for these assays.
Apart from comparing the type of image used (segmented or distance-transformed), these assays have also served to adjust the learning rate with which the optimizer will work, for this reason assays have been carried out varying this value for both strategies.
Specifically, to carry out the assays, three different values for this parameter have been chosen, 5e À5 , 1e À4 y 5e À4 .All these assays have been repeated three times to estimate the mean of the results.These results are shown in Table 2.
Table 2 clearly shows that working with segmented images obtains better precision, moreover precision is better than that obtained using any distances assay.This large difference may be mostly, due to the fact that the distance-transformed images do not fully represent real distance, but rather an approximation with interpolations.
Regarding the learning rate values, it is remarkable that with the highest, the network is unable to converge when working with segmented images, while it does achieve this when working with distance-transformed images.
For the remaining values, although there is not much differentiation as in the case of the types of images, it is observed that for the learning rate value of 1e À4 better results were obtained for the segmented-image assays.In the assays using the distance-transformed images, the change in learning rate value was practically indifferent.So, the learning rate value selected was 1e À4 .
In addition to these assays, and to demonstrate which method provides greater robustness, an assay was performed with rectangles that do not overlap in any of the image pairs from Micro 1 and Micro 2 cameras.To carry out this assay, the simulator configuration was maintained and only the size of the rectangles forming the images was reduced until clearly avoiding overlapping, as shown in Fig. 6.
As shown in Table 3, the fact there is no overlap, the assay carried out with the segmented image is unable to converge, while the distance-transformed images assay can converge as a result of the extra information provided by the distances to the edges of the objects of interest.
These examples show empirically that using distancetransformed images is more robust and can help to converge.By contrast if convergence is achieved, the segmented-image dataset provides greater accuracy.

Dataset of C. elegans
With the conclusions obtained in the synthetic images assays, the first assay that was carried out with C. elegans dataset was the verification of using segmented images the network can converge.As with the synthetic dataset (Tables 2, 3), these assays have been repeated to obtain more reliable results.These results can be seen in Table 4.
As shown in Table 4, this method can achieve convergence, reaching a precision of 0.863 for the IoU metric.Despite this, during the simulations it was also observed that in some cases the network was unable to converge, this may be due to the low initial overlap between the two images and to the usage of reduced batch size.In Fig. 7a, b  shows the graphics of the loss function of an assay that achieved convergence and another that did not, respectively.

New training methods
For this reason, to take advantage of the robustness of the distance-transformed images without disregarding the precision of the segmented images, we decided to design a new mixed training strategy in which both types of images were used.
The first new strategy consists of using distance-transformed images in the first training times and, when a certain level of precision is reached, it is switched to the segmented images to achieve greater precision (Fig. 8), this training method is shown in Algorithm 2. Thus, we have attempted to avoid the convergence problems.The change of dataset was made when the loss function returned a value less than 0.017, which corresponds to an IoU value of around 0.6.
In the mixed method, apart from the change of the dataset, the optimizer was also changed, since, although in both cases the same loss function was used, the  optimization problem changed depending on the type of the image.So, we used two optimizers, both of type Adam and with a learning rate value of 1e À4 .After seeing the results of the first method, a second new strategy was proposed to design a modified mixed strategy to increase the convergence.As in the mixed strategy, there is a differentiation between the first periods and the rest.But, in this case, in these early periods the network does not train solely with distance-transformed images, but rather periods with distance transformed and segmented images have been intercalated (Fig. 10), this new training method is shown in Algorithm 3.

Experiments and results
For each of these methods, a set of experiments has been carried out with the C. elegans dataset, as shown in Tables 5 and 6, respectively.
As it shown in Table 5, the mixed method achieves precision values like the segmented images assays, but the computational cost is higher.Although robustness increased with this methodology, when the switch occurs there is an increase in the loss value, sometimes reaching values even higher than those at the beginning (Fig. 9).This fact could mean that the assay cannot achieve convergence or stay in a local minimum, as happened with the segmented images.
In the modified mixed method, the weights of the network obtained in the first phase are better adapted to the characteristics of the segmented images and this excessive increase in the error is avoided when making the change, as can be seen in the graph in Fig. 11.
The data from the simulations with this second method (Table 6), as in the previous case, reflect levels of precision similar to those obtained with segmented images, and in this case, the computational cost is also somewhat lower, reaching values of the order of the segmented imaging assays.The latter is mainly due to the fact of avoiding that peak in the realization of the change, thus avoiding backsliding in learning.
Therefore, this strategy achieves greater robustness without excessively affecting the level of precision or the computational cost.Finally, this methodology was used in a last assay with the non-synthetic dataset overlap to show that it can converge and verify the increase in robustness that occurred.
At first, the assays with this dataset were unable to converge, since when making the first change of distances to segmented, the level of overlap obtained was insufficient and did not converge.For this reason, we decided to carry out a warm up for the learning rate value of the optimizer that works with segmented images, and, in this way, avoid  The initial value for the learning rate of the optimizer that worked with segmented images was 2e À5 and each time it was carried out with segmented images its value increased by 2e À5 until reaching a value of 1e À4 .In this way, after the first 5 times that this type of image was used, corresponding to the intercalated training phase, this value was reached.
Table 7 shows that this last training method is able to converge with this dataset, reaching a precision level of 0.984 for the IoU metric, representing an increase of 7% with respect to the results with distance-transformed images.Therefore, the increased robustness of this method is demonstrated along with the preservation of the level of precision typical with segmented images.

Conclusions
In this paper, a new training method is proposed for those cases in which the percentage of overlap between the two perspectives of the object of interest is small or even null.It has been demonstrated that the incorporation of distance-transformed images to the training, intercalating these with periods that use segmented images during the first period, solves the correspondence task with greater robustness, thereby preventing convergence problems from occurring during these periods.
Moreover, in addition, all this has been achieved maintaining the level of precision of segmented images and without excessively increasing the computational cost.

Fig. 2
Fig. 2 Scheme of a spatial transform network (STN)

Fig. 3
Fig. 3 Images obtained by the simulator.a Image Micro 1. b Image Micro 2

Fig. 4 Fig. 5
Fig. 4 Images of C. elegans dataset.A cropped segmented image Micro 1. B segmented image Micro 2. C cropped distance-transformed image Micro 1. D distance-transformed image Micro 2

Fig. 6
Fig. 6 Synthetic no overlap images dataset.A cropped segmented Micro 1 image.B Segmented Micro 2 image.C cropped distance-transformed Micro 1 image.D distance-transformed Micro 2 image

Fig. 7 Fig. 8
Fig. 7 Downward trends of assays with C. elegans segmented images.a Assay that has achieved the convergence.b Assay that has not achieved the convergence that in the first segmented time was separated in excess of the solution obtained by the distance-transformed images.

Fig. 10
Fig. 10 Downward trend for the assay with the mixed training method

Fig. 9 Fig. 11
Fig. 9 Schema of modified mixed training method

Table 1
Summary of layers features

Table 4
Assays with C. elegans

Table 5
Assays of mixed training method and C. elegans images

Table 6
Assays with the new modified mixed training method and C. elegans images

Table 7
Assay with new modified mixed training method and no overlap synthetic dataset