1 Introduction

Animal biometrics, especially image-based individual re-identification, has recently gained extensive attention due to both its importance for ecology and conservation and the availability of large volumes of wildlife image data gathered via automatic game cameras and participatory science projects. The benefits of automated re-identification methods are evident as they allow valuable data for conservation efforts to be obtained, for example, accurate population size estimates and novel information about animal migration and behavior patterns (McCoy et al., 2018; Araujo et al., 2020). Compared to traditional methods such as tagging, which may cause stress and change the behavior of the animal, image-based re-identification offers a non-invasive technique for monitoring of endangered species (Norouzzadeh et al., 2018).

Fig. 1
figure 1

Visualization of the proposed re-identification method called Aggregated Local Features for Re-Identification (ALFRE-ID). The input pictures are on the left and the results are on the right. The animal is segmented (orange outline), and the matching regions of the fur pattern are highlighted and connected with lines. The intensity of the highlights corresponds to the similarity of the matched regions

A fundamental challenge for animal identification is the problem of small labeled datasets. This arises in several variations. Firstly, there is an overall lack of images labeled with known individual ids. Generating ground truth animal ids for algorithm training requires a combination of (a) expertise, (b) good heuristics about appearance and location, (c) extensive searching, and (d) effective software tools (Kulits et al., 2021), making the generation of ground truth expensive, time-consuming, and focused on only the most charismatic species. Secondly, there is generally a long-tailed distribution in the number of sightings per individual animal, with many individuals seen just once or a few times, and fewer individuals seen frequently (see Fig. 10). This problem arises in part because of the just-enumerated difficulties in generating ground truth labels, and in part due to the inherent difficulty of obtaining the original data: some individuals are rarely in locations where images are acquired. Thirdly, animal id is generally an open set identification problem: except in special circumstances (Christin, 2015), it is rare that an entire population is represented by the photos in the database. Hence, any set of images added to the database may show new individuals. Effectively addressing these concerns will significantly broaden the utility of animal identification.

A variety of methods for image-based identification exist that utilize distinct characteristics in fur, feather, and skin patterns (Crall et al., 2013; Berger-Wolf et al., 2015; Moskvyak et al., 2021a; Li et al., 2020) or adapt techniques developed for human face re-identification (Deb et al., 2018; Crouse et al., 2017; Agarwal et al., 2019). Traditional methods require the least prior information, and therefore in practice are still being used extensively (Berger-Wolf et al., 2017), but they are significantly limited in how they exploit any available training data. Methods that learn without identity labels require manually selecting the features—such as ear, fin and fluke contours Weideman et al. (2020)—and are limited by both the need for manual generation of feature training data and the ability to select these features in the first place. Finally, deep learning methods, which offer the most power and flexibility are data-hungry and therefore greatly challenged by the limited-data scenario that can occur for animal re-identification.

In this paper, we propose a pipeline that combines the best of these approaches. This is obtained by utilizing deep CNN-based local features and feature aggregation. We call the pipeline Aggregated Local Features for Re-Identification (ALFRE-ID). By aggregating learnable local features, it is possible to obtain representative pattern feature embeddings that provide high re-identification accuracy similar to deep metric learning-based methods. At the same time, the possibility of using pretrained local feature descriptors allows us to apply the method to small datasets much more accurately than end-to-end deep learning methods. The proposed pipeline is inspired by content-based image retrieval (CBIR) methods and builds on earlier work (Nepovinnykh et al., 2020) where Siamese networks were utilized to learn a similarity metric for local patches of pelage patterns. We further develop this approach by utilizing affine invariant local CNN features and aggregating them into a fixed-size embedding vector describing global features. The full re-identification pipeline consists of tone mapping, animal segmentation, feature extraction, computation of aggregated pattern feature embeddings, selection of potential matches by finding the most similar embeddings in the database of known individuals, and geometric verification and final match ranking by analyzing the spatial consistency of the pattern similarities (see Fig. 1). The pipeline follows a modular approach where individual techniques such as local feature extractors can be changed to address differences between animal species.

Fig. 2
figure 2

Example images of the main identifiable features from publicly available re-identification data sets: a Plains zebra (Equus quagga) (Parham et al., 2017): stripe fur pattern; b Masai giraffe (Giraffa tippelskirchi) (Parham et al., 2017): spot fur pattern; c Amur tiger (Panthera tigris) (Li et al., 2020): stripe fur pattern; d African elephant (Loxodonta africana) (Korschens & Denzler, 2019): head shape; e Saimaa Ringed seal (Pusa hispida saimensis) (Nepovinnykh et al., 2022c): ringed fur pattern; f Humpback whale (Megaptera novaeangliae) (Cheeseman et al., 2017): fluke shape; g Whale shark (Rhincodon typus) (Holmberg et al., 2009): skin spot pattern; h Chimpanzee (Pan troglodytes) (Freytag et al., 2016): face

Our contributions are summarized as follows:

  1. 1.

    We propose a CBIR-motivated pipeline for individual animal identification called ALFRE-ID that includes interchangeable learned local features, feature aggregation, and feature embeddings to address the limitations of current methods, especially on small labeled datasets.

  2. 2.

    We experimentally demonstrate the advantages and tradeoffs of our pipeline in comparison to widely-used traditional methods based on non-learned, hand-crafted features (Hotspotter) and end-to-end deep learning methods.

  3. 3.

    We evaluate the pipeline’s performance on two very different and challenging animal species showing trade-offs between various component options, demonstrating that a flexible pipeline of components is crucial for performance on small training datasets.

2 Related Work

2.1 Animal Re-identification

Animal re-identification is a broad term referring to the process of identifying an individual animal based on its features. The features are based on biological traits, and they can be captured in a number of ways, for example, acoustically (Hartwig, 2005; Pruchova et al., 2017) or visually in the form of images (Vidal et al., 2021) or videos (Zuerl et al., 2023). Currently, image-based methods are the most widely utilized approach due to the relative ease of data acquisition and manual analysis (Schneider et al., 2019).

Various animal species can be re-identified by different types of visually unique biological traits such as fur pattern, face, or fin shape. Examples of such traits are presented in Fig. 2. Re-identification methods can be divided into three categories: (1) traditional, non-learning methods that depend on hand-crafted local features, (2) methods that learn feature descriptions by manually selecting the biological traits, and (3) end-to-end deep learning methods. The first category consists of methods that extract and match hand-crafted local features such as SIFT (Lowe, 1999) between images and perform the re-identification typically by quantifying the similarity of the matching regions or the geometric consistency of the matched point pairs. For example, HotSpotter (Crall et al., 2013) is a SIFT-based re-identification algorithm that uses viewpoint invariant descriptors and a scoring mechanism that emphasizes the most distinctive key points, called “hot spots,” on an animal pattern. A similar approach was proposed in Pedersen et al. (2023) where multiple local feature descriptors including SIFT, SURF, and SuperPoint were compared on giant sunfish re-identification. Lalonde et al. (2022) proposed to use transformer-based local features (Sun et al., 2021), instead of traditional hand-crafted features and a simple point correspondence confidence based matching criteria for blue whale re-identification. Algorithms in this category are species-agnostic and can be applied to wide-variety biological traits. The HotSpotter algorithm has been successfully used for re-identification of zebras (Equus quagga) (Crall et al., 2013) and giraffes (Giraffa tippelskirchi) (Parham et al., 2017), jaguars (Panthera onca) (Crall et al., 2013), ocelots (Leopardus pardalis) (Nipko et al., 2020), and leopards (Panthera pardus) (Suessle et al., 2023).

The second category of methods utilizes species-specific traits such as ear [e.g., Asian elephant (De Silva et al., 2022)], fin [great white sharks (Hughes & Burghardt, 2017)], and fluke contours [e.g., humpback whale (Weideman et al., 2017, 2020)]. Both traditional feature-engineering based approaches and deep learning methods have been proposed to compute the feature (e.g., shape) representation for the selected traits. Examples of efficient algorithms for deep learning edge-based re-identification include CurvRank (Weideman et al., 2017), finFindR (Thompson et al., 2019, 2022), OC/WDTW (Bogucki et al., 2019) and the ArcFace-based method by Cheeseman et al. (2022). These methods have been applied to marine mammals such as bottlenose dolphins (Tursiops truncatus) (Tyson Moore et al., 2022; Thompson et al., 2019, 2022; Patton et al., 2023), humpback whales (Megaptera novaeangliae) Webber et al. (2023); Patton et al. (2023), right whales (Eubalaena glacialis) (Khan et al., 2022; Patton et al., 2023), and they use the unique shape of tail or fins to identify the animals. Similar deep learning methods have been also used to learn feature descriptors for cattle muzzle (Kumar et al., 2018) and primate faces (Deb et al., 2018; Brust et al., 2017). Since the methods in this category operate by quantifying the specific visual traits, distinguishing the individuals of the species of interest, they can be often trained without identity labels. However, this also makes the methods species-specific which limits their wider usability.

The third category consists of methods that utilize deep learning techniques such as convolutional neural networks (CNNs) to learn the feature embeddings or re-identification in an end-to-end manner without the need to manually select the biological traits to be used for the re-identification. These methods can be divided into classification and metric-based approaches (Vidal et al., 2021). The classification-based approaches (see e.g. de Silva et al. 2022) assume that the database of individuals is known and fixed, allowing the final algorithm to only identify individuals from that database. The metric-based methods (see e.g. Schneider et al. 2022), on the other hand, aim to learn a similarity metric between the input images. The re-identification is then performed by clustering or matching based on the similarity, which means that metric-based approaches are not limited by the initial database and can be applied to new individuals without retraining. Metric-based methods are generally preferred since obtaining the full dataset containing all individuals is practically impossible for any wildlife application. However, it should be noted that it is possible to extend the classification-based methods to tackle the open-set problem. For example, Kim et al. (2022) proposed to use a CNN-based classifier with the OpenMax layer to address the missing individuals in the training set.

Most recent methods for animal re-identification utilize deep learning, particularly CNNs (Schneider et al., 2019, 2020). CNNs have been successfully applied for re-identification of Amur tigers (Panthera tigris) (Li et al., 2020; Liu et al., 2019a, b), zebras (Equus quagga) and giraffes (Giraffa tippelskirchi) (Badreldeen Bdawy, 2021), undulate skate (Raja undulata) (Gómez-Vargas et al., 2023), and bumblebees (Bombus terrestris) (Borlinghaus et al., 2023). In order to improve re-identification accuracy, pose estimation and key point alignment have been proposed (Yeleshetty et al., 2020; Yu et al., 2021; Moskvyak et al., 2021b).

PIE (Moskvyak et al., 2021a) is a deep learning-based method for matching of individuals which is invariant to the pose. The method receives shape embedding and pose embedding separately and normalizes the shape to match the individual regardless of the specific pose. PIE was originally developed for manta rays (Moskvyak et al., 2021a), but it has been also used for humpback whale flukes, orcas, and right whales. Apart from CNNs, also vision transformers have been proposed for animal re-identification (Zheng et al., 2022). While end-to-end deep learning methods have been shown to produce state-of-the-art performance when the amount of training data is large, their data-hungry nature limits their applicability on species for which large-scale databases are not available.

A number of methods for the re-identification specific to Saimaa ringed seals—one of our target species—have been proposed (Zhelezniakov et al., 2015; Chehrsimin et al., 2018; Nepovinnykh et al., 2018, 2020; Chelak et al., 2021; Nepovinnykh et al., 2022a, b, 2023; Immonen et al., 2023). Saimaa ringed seals are especially challenging species for re-identification due to several issues: (i) a large variation in possible poses, exacerbated by the deformable nature of the seals, (ii) non-uniform pelage patterns, limiting the size of the regions that can be used for the re-identification task, (iii) low contrast between the ring pattern and the rest of the pelage, and (iv) extreme dataset bias since the collected dataset contains disproportionally more images of some selected individuals and the variety of the backgrounds is extremely small due to the limited number of camera trap locations. These challenges have been addressed by proposing various approaches to preprocess the images and to encode the pattern features (Zhelezniakov et al., 2015; Chelak et al., 2021; Nepovinnykh et al., 2020, 2022a). The most successful methods employ a pattern extraction step (Nepovinnykh et al., 2020, 2022a) to construct a binary representation of the pelage pattern and metric learning-based pattern encoding.

Individual whale sharks can be identified based on the spot pattern on their skin. Arzoumanian et al. (2005) applied a blob detection to find the individual spots, and used pattern-matching algorithm (Groth, 1986) originally developed for astronomical images (star patterns) to compare the patterns. Kholiavchenko (2022) utilized a U-Net-based model for spot detection and a metric learning-based approach generated pattern embeddings for the re-identification of individuals.

Fig. 3
figure 3

ALFRE-ID re-identification pipeline

2.2 Content Based Image Retrieval

The task of visual animal re-identification can be formulated as a task of finding the most similar image from the database to the given query image. This formulation matches the definition of content-based image retrieval (CBIR) (Smeulders et al., 2000) and motivates the study of the suitability of CBIR methods for animal re-identification. CBIR methods have already been applied to the task of animal re-identication (Nepovinnykh et al., 2022a).

CBIR methods usually consist of two main steps: feature extraction and feature aggregation. The feature extraction problem can be solved using standard hand-crafted features, such as Scale Invariant Feature Transform (SIFT) (Lowe, 2004; Arandjelović & Zisserman, 2012), or extraction by convolutional neural networks [see, e.g., (Mishchuk et al., 2017))]. Then, feature aggregation creates a descriptor for each image that can be used to find the most similar image from the database. Traditional methods such as Bag of Words (BOW) (Sivic, 2003), Vector of Locally Aggregated Descriptors (VLAD) (Jégou et al., 2010) and the Fisher Vector (Perronnin & Dance, 2007; Perronnin et al., 2010; Hutchison et al., 2010) do the aggregation using a specially constructed codebook. The codebook is usually created by an unsupervised clustering algorithm. For example, k-means (MacQueen et al., 1967) is used for VLAD, and a Gaussian Mixture Model (GMM) (McLachlan & Basford, 1988) is used for the Fisher Vector. Finally, fixed-size descriptors are created for each image based on the vocabulary and extracted features. The distance between these descriptors is inversely proportional to the visual similarity.

Due to the availability of data and the convenience of end-to-end approaches, deep learning-based methods for CBIR are becoming increasingly popular such as NetVLAD (Arandjelovic et al., 2016) where a generalized VLAD layer is used to aggregate CNN-extracted features.

Also, visual localization (Sarlin et al., 2019) shares similarities with CBIR and the animal re-identification. In visual localization, the task is to find a location in an environment that corresponds to a given image. While the formulation of CBIR is more closely related to the animal re-identification, similar steps are utilized also in visual localization including pose estimation, feature aggregation, database search, and geometrical verification.

3 Pipeline

The proposed ALFRE-ID pipeline is inspired by CBIR techniques and consists of seven steps (see Fig. 3): (1) image prepossessing, (2) instance segmentation, (3) pelage pattern extraction, (4) feature extraction, (5) feature aggregation, (6) individual re-identification, and (7) geometric verification. Some of these steps involve choices of different methods depending on the species.

3.1 Image Preprocessing

Depending on illumination conditions, variation in the contrast of the images can be rather high. This could lead to a loss of detail in the region of interest, i.e., the animal and its fur pattern. In order to rectify this issue, we employ the tone-mapping approach to equalize the contrast in dark and bright image regions. The algorithm proposed by Mantiuk et al. (2006) is used due to its ability to produce realistic tone-mapped images without introducing visual artifacts. This method considers contrast on multiple spatial frequencies while using gradient methods with some additional extensions to ensure that the global brightness levels are not reversed and low-frequency details are properly reconstructed. Examples of images before and after prepossessing are presented in Fig. 4.

Fig. 4
figure 4

Examples of the image processing of camera trap images. The images on the left are the originals. The right column demonstrates the result of the tone-mapping

3.2 Instance Segmentation

The instance segmentation step is important in the common scenario where datasets are collected using static camera traps. This together with the fact that individual animals tend to use the same sites or areas inter-annually causes one individual to be very often captured with the same camera (the same background). This increases the risk that the supervised re-identification algorithm learns to identify the background instead of the actual animal if the full image or the bounding box around it is used. Consequently, this algorithmic behavior may lead to such a situation where the method is unable to identify the animal in a new environment.

The model selection depends on the species. Animals captured in groups require instance segmentation such as Mask R-CNN (He et al., 2017) while for solitary animals, the segmentation can be solved with simpler semantic segmentation models. For various common animal species, pretrained models are already available (Bello et al., 2021; Chen & Belbachir, 2023; Dai & Liu, 1966). If this is not the case, transfer learning can be utilized. Recent natural language processing based promptable segmentation methods such as the Segment Anything Model (SAM) (Kirillov et al., 2023) provide flexible segmentation models for new target species via zero-shot transfer. Details on how the instance segmentation was implemented for the target species are given in Sect. 4.1.

3.3 Pelage Pattern Extraction

The main identifying feature of many species is their fur, feather, or skin pattern. Often the pattern is both permanent and unique to each individual, and therefore, quantifying the pattern can form the basis of individual re-identification. Depending on the species it can be beneficial to focus the attention on the pattern and discard irrelevant information causing database bias—such as illumination and other visual factors—by extracting the pattern from the images (Nepovinnykh et al., 2022a). The pattern extraction can be formulated as an image binarization problem and solved using encoder-decoder networks. The result of the pattern extraction step is a binary image containing only the pattern.

Due to the differences in fur patterns between species, the pattern extraction can be unnecessary or require a custom model. Detailed descriptions of the pattern extraction step for the target species are provided in Sect. 4.1. Since the animal is first segmented and the pattern colors often follow a bimodal distribution (dark pattern on light background or vice versa), reasonably good pattern extraction accuracy can be obtained with segmentation models pretrained on other species if necessary training data is not available for the target species. For example, Immonen et al. (2023) applied successfully a pattern extraction model trained on Saimaa ringed seals to whale sharks.

3.4 Feature Extraction

Local feature extraction and description have shown to be efficient tools for animal re-identification (Berger-Wolf et al., 2017; Nepovinnykh et al., 2022a). However, traditional hand-crafted local features such as SIFT are significantly limited in how they exploit any available training data. Modern learning-based local features, on the the hand, leverage the benefits of deep learning and CNNs to obtain representative feature descriptors, making them an attractive alternative for animal re-identification.

Wild animals can be found in a variety of poses resulting in distorted and warped patterns on images. While the pattern as a whole is transformed in a non-linear way, it can be argued that small local regions experience close to affine transformations, making an affine invariant feature extractor suitable for the task. Modern CNN-based local feature extraction approaches allow learning affine invariant feature descriptors using general-purpose datasets. This makes the feature extraction step flexible in a sense that different feature detector and descriptor combinations can be used for different species without the need for additional training.

HesAffNet (Mishkin et al., 2018) is a modification of the classical Hessian Affine Region detector (Mikolajczyk & Schmid, 2002, 2004) where the shape estimation step is done by the AffNet CNN. The detector is based on the Harris cornerness measure (Harris & Stephens, 1988) which uses a second moments matrix to find regions of interest by estimating the most prominent gradient directions. This method is combined with the multiscale approach from (Lindeberg, 1998) which uses Laplacian of Gaussian to find extrema in the scale space. The same concept can be further extended to all affine transformations, not just the scale. However, the degree of freedom is much higher for affine transformations, which complicates the process and requires a special shape adaptation algorithm. The original Hessian Affine detector used Baumberg iteration (Baumberg, 2000), which is replaced by an AffNet CNN in HesAffNet.

AffNet and HardNet are closely related, sharing the same architecture and using similar training procedures. During the training of HardNet (Mishchuk et al., 2017), batches of matching patch pairs are chosen, each containing an anchor \(a_i\) and positive match \(p_i\). Each patch is encoded by the network, and a matrix of pair-wise distances between all anchors and positive matches is computed. For each pair, the closest non-matching descriptor from the batch is chosen, and a final hard negative margin loss is computed as

$$\begin{aligned} \begin{aligned} L =&\frac{1}{n} \sum _{i=1}^n \max (0, 1+d(a_i, p_i) \\&- \min (d(a_i, p_{j\min }), d(a_{j\min }, p_i))), \end{aligned} \end{aligned}$$
(1)

where \(d(\cdot , \cdot )\) is the distance function, \(p_{j\min }\) is the closest non-matching positive to \(a_i\), and \(a_{j\min }\) is the closest non-matching anchor to \(p_i\).

AffNet utilizes a slightly different training procedure, the main difference being that the derivative for the negative term in the loss is set to 0. This loss is called hard negative constant and helps avoid situations where positive samples cannot be moved closer together because of a negative sample lying between them in the metric space. The training procedure for AffNet is also more complicated since it is learning affine shapes and not just a distance metric. Therefore, spatial transformers are used to transform input patches according to the predicted shape, which are then fed into a descriptor network, e.g., HardNet, and only then is the loss calculated and backpropagated through both networks. The example of HesAffNet application to a preprocessed image is visualized in Fig. 5.

Fig. 5
figure 5

Visualisation of Hessian Affine patch extraction: a segmented image; b HesAffNet-based patch extraction. Note that while original images are used for visualization purposes, the features are extracted from pattern images. Extracted regions are highlighted in green (Color figure online)

Going a step further, the Hessian detector can be replaced by a learned CNN-based method. Key.Net (Barroso-Laguna & Mikolajczyk, 2022) uses a combination of manually hardcoded and learned filters along with a multi-scale pyramid. This feature detector can be used in conjunction with AffNet and HardNet for a full feature detection and encoding pipeline.

Alternatively, DISK (Tyszkiewicz et al., 2020) provides an end-to-end framework for both feature extraction and encoding. DISK uses the U-net network as the backbone and utilizes reinforcement learning in order to train it. The network is trained to obtain a high number of correct matches by using a policy gradient method, keeping training and inference very close to each other. The network outputs dense descriptors and keypoints heatmap, which together could be combined to obtain discriminative sparse keypoints.

3.5 Feature Aggregation

Features are aggregated using Fisher Vector (Perronnin & Dance, 2007; Perronnin et al., 2010; Hutchison et al., 2010). First, Principal Component Analysis (PCA) is applied to the feature embeddings to decorrelate the features and reduce the dimensionality. This is important for Fisher Vectors, which are known to produce large descriptors. The images in the database of known individuals are used to learn principal components. Next, a visual vocabulary (codebook) is constructed by applying a Gaussian Mixture Model (GMM) to the features from the database. Then, Fisher Vectors are created for each image by computing the partial derivatives of the log-likelihood function with respect to the GMM parameters and concatenating them.

3.5.1 Fisher Vector

Let \(X = \{x_t, t=1, \ldots , T\}\) be a sample of T observations and \(u_\lambda \) be a probability density function modeling the distribution of the data, where \(\lambda \) is a vector of its parameters. The score is defined as the gradient of the log-likelihood of the data on the model:

$$\begin{aligned} G^X_{\lambda } = \nabla _{\lambda }\log u_{\lambda }(X). \end{aligned}$$
(2)

This score function can be used to define the Fisher Information Matrix (FIM) (Amari & Nagaoka, 2000):

$$\begin{aligned} F_{\lambda } = E_{x\sim u_{\lambda }}[G^X_{\lambda }{G^X_{\lambda }}'], \end{aligned}$$
(3)

which acts as a local metric for a parametric family of distributions. This metric can also be used to measure the similarity between 2 samples using the Fisher Kernel (FK) (Jaakkola & Haussler, 1999):

$$\begin{aligned} \begin{aligned} K_{FK}(X,Y)&= {G^X_{\lambda }}' F^{-1}_{\lambda } G^Y_{\lambda } \\ {}&= {G^X_{\lambda }}' {L_{\lambda }}' L_{\lambda } G^Y_{\lambda } \\ {}&= {\mathscr {G}^X_{\lambda }}' \mathscr {G}^Y_{\lambda }, \end{aligned} \end{aligned}$$
(4)

where \(L_\lambda 'L_\lambda \) is the Cholesky decomposition of \(F^{-1}_{\lambda }\), \(G^X_{\lambda }\) and \(G^Y_{\lambda }\) are the Fisher Vectors of samples X and Y respectively. By using Fisher Vectors, it is possible to calculate the kernel as a simple dot product, which can be efficiently utilized by linear classifiers. When constructing a Fisher Vector for an image, a set of local features is assumed to be independent, meaning that the final descriptor can be constructed as a sum of Fisher Vectors for each local feature, i.e.,

$$\begin{aligned} G^X_{\lambda } = \sum _{t=1}^T L_\lambda \nabla _{\lambda }\log u_{\lambda }(X). \end{aligned}$$
(5)

Usually, a Gaussian Mixture Model (GMM) is used as \(u_\lambda \), since it can be used to approximate any continuous distribution with arbitrary precision (Titterington et al., 1985). The gradients of the GMM parameters are concatenated into a vector of size 2DK where D is the dimensionality of samples and K is the number of components in GMM. It has been shown (Hutchison et al., 2010) that L2 and power normalization generally improve the performance of the method. Therefore, it is common to apply power and L2 normalization to the Fisher Vector to get the final descriptor.

3.6 Individual Re-identification

Re-identification is done by calculating the cosine distance from the query image descriptor to each image descriptor in the database of known individuals as

$$\begin{aligned} d_L = 1-\frac{\Phi _q \cdot \Phi _{db}}{||\Phi _q ||_2 ||\Phi _{db} ||_2}, \end{aligned}$$
(6)

where \(\Phi _q\) is the Fisher vector for query image and \(\Phi _{db}\) is the Fisher vector for a database image. This distance quantifies the dissimilarity of the aggregated local pattern appearances between the images. The individuals in the database are ranked based on the distances, the first-ranked being the most likely match.

3.7 Geometric Verification

Aggregated local pattern appearance does not take into account the global spatial structure of the pattern. To further incorporate this information to the pattern matching, the geometric consistency of the local similarities is analyzed. This is done using a similar method as the spatial reranking step of the HotSpotter algorithm (Crall et al., 2013) and the object retrieval method proposed in Philbin et al. (2007). Local interest points extracted from each image are matched to find the feature correspondences between query and database images. The matching is done by computing cosine distances between the embeddings of individual feature pairs.

The image coordinates of feature correspondences are then normalized to have the zero mean and the maximum distance of 1 to the origin. Outliers (and inliers) are detected by estimating the parameters of a homography between the query image and database image using RANSAC. The assumption is that if the patterns do not match, the inconsistency in the global arrangements of feature correspondences causes a low number of inliers. Therefore, the number of inliers, n, is a good metric for geometric similarity of patterns. It should be noted that due to the large pose variation of animals, it is recommended to have a high inlier threshold to ensure successful outlier detection in the case of matching patterns.

The final re-identification of the animal individual in the query image is performed by searching the most similar pattern from the database of known individuals. To compute the dissimilarity (distance) a novel combination of the dissimilarity of aggregated local pattern appearance and geometric dissimilarity of patterns is used. We use the following reranking rule:

$$\begin{aligned} d_C = (d_L) ^ n, \end{aligned}$$
(7)

where \(d_L\) is the cosine distance between Fisher vectors (aggregated local pattern appearance) and n is the number of inliers. The geometric consistency, defined as a number of inliers n, has an exponential influence on the cosine distance (\(d_L \le 1\)). If the number of individuals in the database is large, re-identification can be made more efficient by using the aggregated Fisher vector for quick database searches and using the geometric similarity only as a reranking or verification step.

4 Experiments and Results

Our experiments are focused on two key issues: (i) the impact of modern pre-trained feature extraction algorithms on content-based-retrieval approaches to individual animal identification and (ii) the impact of training data size on the relative efficacy of local feature-based methods as opposed to end-to-end deep learning based methods.

4.1 Data

We consider two very different patterned animals: Saimaa ringed seals and whale sharks. Saimaa ringed seal patterns consist of local arrangements of ring-like shapes. The regions enabling the re-identification often constitute a rather small portion of the whole pattern. This together with the fact that the contrast between the pattern and the rest of the body is low and the appearance of the pattern varies, makes this a challenging dataset. Whale shark patterns, on the other hand, consist of small spots with similar appearance and the main trait allowing the re-identification is the geometric arrangement of the spots. Small differences between individuals and a large variation in image quality due to underwater imaging further complicate the re-identification task.

4.1.1 Saimaa Ringed Seals

The re-identification dataset consists of 57 individual seals with a total of 2080 images. The dataset is divided into two subsets: the database subset (430 images) and the query subset (1650 images). The database subset contains a minimal number of high-quality unique images that are enough to cover the full body pattern of each seal. The query subset contains the remaining images of the same individuals as in the database. It should be noted that the high-quality images were prioritized when constructing the database and, therefore, images in the query subset often have lower quality. Examples of images from both subsets are presented in Fig.  6. The dataset has been made publicly available. For further description of the dataset, see Nepovinnykh et al. (2022c).

Fig. 6
figure 6

Examples from the database and query datasets. Every row contains images of an individual seal. For every image from the query dataset (left) there is a corresponding subset of images from the database (right)

Images were segmented using Mask R-CNN (He et al., 2017). A segmentation model trained for Ladoga ringed seals from Nepovinnykh et al. (2022b) was utilized. This is possible due to the two species being visually almost indistinguishable. Ladoga ringed seals are more numerous than Saimaa ringed seals and they are often captured in large groups which makes it easier to collect and annotate large training data for the segmentation. For more details about the instance segmentation model and training procedure see Nepovinnykh et al. (2022b). After the segmentation masks were obtained, morphological opening and closing operations were applied to close the holes and smooth the borders by using morphological closing and opening. Examples of segmentation results are presented in Fig. 7.

Fig. 7
figure 7

Examples of the segmentation masks. The images on the left are the originals. The mask is highlighted in blue and the background is highlighted in red on the middle images. The last column shows the result of the segmentation (Color figure online)

Saimaa ringed seal pattern was extracted using the U-net encoder-decoder architecture (Ronneberger et al., 2015). The pattern image was further post-processed to remove small noise by using unsharp masking and morphological opening. Finally, all images were resized in such a way that the mean width of the pattern lines was the same for all images, bringing them into the same scale. The line width for each image can be approximated as a ratio of the number of all white pixels to the number of all pixels in the morphological skeleton. This operation is necessary since the images were obtained from a variety of sources and have a large variation in image resolution. Example results for pattern extraction are shown in Fig. 8. For a more detailed explanation of the seal pattern extraction step, as well as the comparison to other methods, see Zavialkin (2020).

Fig. 8
figure 8

Visualization of the pattern extraction step for the Saimaa ringed seals: (a) and (b), and for the whale sharks: (c)–(e)

4.1.2 Whale Sharks

To study the re-identification, the whale shark identification dataset provided by Wild Me (Holmberg et al., 2009; Blount et al., 2022) has been used. The database of whale sharks was curated using the semi-automatic Modified Groth algorithm (Arzoumanian et al., 2005; Holmberg et al., 2009) to suggest matches, that were verified whale shark experts. Each image in the dataset is accompanied by a bounding box delineating the torso of the whale shark’s body, an individual identification tag, and the viewpoint of the animal (right or left). Therefore, examples of whale shark images cropped according to bounding boxes are presented in Fig. 9. The dataset is divided into training and test subsets for training neural network-based methods. The training subset comprises a total of 5409 annotated sightings, specifically pertaining to 235 distinct whale shark viewpoints (unique combinations of an individual and a viewpoint). The test subset consists of 1543 sightings belonging to 412 unique viewpoints. No individuals present in the training set are included in the test set. The image distribution of images for the training subset can be seen in Fig. 10. Since the query/database split is not provided in this dataset, a leave-one-out strategy is used to assess the re-identification accuracy. That is, each image is compared to all other images.

Fig. 9
figure 9

Sample images from the whale shark dataset

Fig. 10
figure 10

Image distribution across the whale shark dataset, displaying the number of classes (individual + viewpoint) and the corresponding number of images along the x-axis: a training subset; b test subset. For example, in the test subset, there are around 150 classes with less than 2 images per class

Whale sharks exhibit a pattern characterized by an array of spots, which adorn their massive bodies. These spots, varying in size and spacing, create a unique mosaic-like arrangement that serves as a natural identifier for individual whale sharks. To accurately extract the pattern, we adopt a specialized approach that centers on segmenting these white spots. The segmentation process involves the neural network to perform image classification at the pixel level, where it precisely identifies and delineates each individual white spot on the whale shark’s body. In our work, we adopt the U-net architecture (Ronneberger et al., 2015) with the SEResNet34 (Hu et al., 2018) backbone that has been successfully applied for similar problems such as blood vessel segmentation from medical images (?). The U-net architecture consists of encoder and decoder parts. The encoder part hierarchically encodes input images into a latent representation, effectively capturing essential features. A decoder part employs up-sampling layers to expand the latent representation to match the original image dimensions. Encoder layers pass over information to the corresponding decoder layers with the help of special connections. This helps to transfer the classification context to the localization part. The neural network’s objective is to recognize the presence of white spots and outline their boundaries, down to the finest details. The resulting outcome is a set of binary masks, where every non-zero pixel value corresponds to the location of a white spot on the whale shark’s body.

4.2 Comparison of Feature Descriptors

In order to select the most suitable feature descriptor for each dataset, re-identification for each dataset has been performed using three different deep local features: (1) HesAffNet (Mishkin et al., 2018) for feature detection and HardNet (Mishchuk et al., 2017) for feature description, (2) Key.Net (Barroso-Laguna & Mikolajczyk, 2022) + AffNet (Mishkin et al., 2018) for feature detection and HardNet (Mishchuk et al., 2017) for description (which will be referred to as Key.Net + HardNet for the sake of brevity), and (3) DISK (Tyszkiewicz et al., 2020) feature detectors/descriptors.

4.2.1 Saimaa Ringed Seals

Results for the SealID dataset are presented in Table 1. It is clear that the choice of the feature extractor and descriptor greatly affects the final accuracy. The difference between the DISK and HessAffNet + HardNet is around 30%, with the DISK showing the worst results and HessAffNet + HardNet the best. The results also indicate that the pattern extraction step is integral to the re-identification on the SealID dataset, increasing the accuracy by about 20%.

Table 1 Experiments with different descriptors on the SealID dataset

4.2.2 Whale Sharks

For the whale shark dataset, only a training subset was used to create codebooks for PCA and GMM. It should be noted that since we are using pre-trained local feature detectors and descriptors, no method training is needed. Codebook generation does not require identity labels which makes it possible to test two realistic scenarios: (1) the codebook is generated for the same set of images the re-identification is applied (fine-tuned codebook) and (2) the codebook is generated and tested with a different set of images (pre-generated codebook). The first corresponds to a scenario where re-identification is applied to a fixed set of images collected earlier. In this case, the re-identification process starts with the generation of a codebook and proceeds to re-identify the animal in each image. In the second scenario, the codebook is generated beforehand (offline) and the re-identification happens online while new images are collected. It is good to notice that since the subset used to generate the codebook (training set) and subset used to test the re-identification accuracy do not contain the same individuals, this is even more challenging than a typical scenario, where, at least, some individuals have been captured earlier and can be used for generating the codebook.

Both fine-tuned and pre-generated codebooks were tested using a leave-one-out strategy. The results for the dataset without the pattern extraction step are presented in Table 2. DISK approach achieves the highest re-identification accuracy in contrast to the SealID dataset, where it performed the worst. The results for the pre-generated codebooks are consistently worse than for the fine-tuned codebook, which is the expected consequence of the fact of how the codebooks were created. Results for different feature extractors and descriptors on the whale shark dataset with pattern extraction step are presented in Table 3. Moreover, DISK applied to the original images outperforms all other feature extractors and descriptors both with and without the pattern extraction step. With the addition of the pattern extraction step, both Key.Net + HardNet and DISK perform comparably well, achieving higher accuracy than HessAffNet + HardNet, with DISK producing slightly higher accuracy scores when using pre-generated codebook than Key.Net + HardNet. Surprisingly, while pattern extraction significantly increases accuracy for the HessAffNet + HardNet and Key.Net + HardNet approaches, the DISK method produces better results using original images.

Table 2 Experiments with different descriptors on the whale shark dataset without pattern extraction
Table 3 Experiments with different descriptors on the whale shark dataset with pattern extraction

4.3 Comparison to PIE

PIE (Moskvyak et al., 2021a) is an end-to-end deep learning method for re-identification. The main problem with PIE and similar methods for re-identification is their need for a large amount of the labeled training data which is often not available for wildlife applications. Acquiring and labeling large datasets of animal individuals is a difficult and tedious task requiring expertise, time and effort. With that in mind, one of the main advantages of the proposed ALFRE-ID pipeline is that the core of the algorithm does not require training on the target dataset as the feature extractors and descriptors are pretrained and the codebook generation does require labeled data. In order to simulate a real-world scenario where fully labeled data is scarce, we compared the ALFRE-ID pipeline to PIE with different sizes of training set: 100%, 50%, and 25% of the original training sets. For ALFRE-ID, the training set was only used to generate the PCA and GMM codebooks. The reduction is done on a per-individual basis, i.e. for each individual only 50% of available images from the full train set are used for the training/codebook generation. 100% of the training set corresponds to the standard split.

In order to compare PIE to ALFRE-ID on the SealID dataset, a special train-test split has been used. The whole dataset, i.e., the union of the query and database subsets, has been divided in the following manner: if the individual contains more than 6 samples, it is assigned to the training set and otherwise to the test set. Therefore, the training and test sets contain a different set of individuals similar to the whale shark data set. The final scores are presented for the leave-one-out re-identification on the test set. The results are presented in Table 4.

As expected the size of the set used to generate the codebook does not have a large influence on the re-identification accuracy of the ALFRE-ID method. The difference in accuracy between the full and a half training set for different datasets is between 1% and 7%. Further reducing the size of the training set to 25% of its original size does not have a negative effect on the accuracy. Contrary to the ALFRE-ID, the accuracy of PIE drops significantly when the size of the training set is reduced. The accuracy on the whale shark dataset drops from 86% to 51% when the size of the training set is reduced to a quarter of its original size. When using pregenerated codebook for ALFRE-ID, PIE shows higher accuracy only if 50% or more of the available images are used for training. In SealID, a similar large drop can be observed. Results on fine-tuned codebook are again considerably better than on pregenerated codebook. However, it should be noted that the accuracies on fine-tuned codebook are not fully comparable with those on PIE as the test set is different.

Table 4 Experiments with different sizes of the training set

4.3.1 Comparison to Hotspotter

Hotspotter (Crall et al., 2013) is another popular species-agnostic re-identification algorithm that uses local features (SIFT) for the re-identification. The comparison between ALFRE-ID and HotSpotter for both datasets is presented in Table 5. ALFRE-ID outperforms Hotspotter on both datasets, with a lead of about 20%. Moreover, only small differences between Top-1 and Top-5 scores for Hotspotter can be observed, while the increase in accuracy for the ALFRE-ID method is clear. That means that ALFRE-ID would provide more benefit in the semi-automatic re-identification scenario where the set of best matches is provided for an expert for the final verification. The results indicate that the modern CNN-based local features together with feature aggregation significantly increase the re-identification accuracy compared to traditional local feature-based methods.

Table 5 Comparison with Hotspotter. HessAffNet + HardNet feature descriptor is used for the SealID dataset when testing ALFRE-ID. The DISK feature descriptor without pattern extraction is used for the whale shark dataset

5 Conclusion

In this paper, a novel pipeline for patterned animal re-identification called Aggregated Local Features for Re-Identification (ALFRE-ID) was proposed. The pipeline utilizes modern deep learning-based local features and feature aggregation inspired by content-based image retrieval techniques. The full re-identification pipeline consists of image enhancement, animal instance segmentation, optional fur pattern extraction, feature extraction, feature aggregation, individual re-identification by database search, and geometric verification steps. The pipeline follows a modular approach where individual techniques can be changed to address differences between animal species. The main benefit of the proposed approach is that by utilizing pretrained local feature descriptors no labeled training data is needed to deploy the re-identification model to new species. At the same time, powerful feature representations are obtained via feature aggregation enabling comparable re-identification accuracy to deep learning-based end-to-end models that require significantly larger amount of training data. This makes it possible to apply the pipeline to the new animal species for which large-scale labeled databases are not available. We evaluated the method against other state-of-the-art data-driven and hand-crafted animal re-identification methods on two challenging datasets of Saimaa ringed seals and whale sharks. Our method clearly outperformed the competing methods under limited training data scenarios. As future work, we plan to apply and test our method on more animal species.