1 Introduction

Phytoplankton are key players in aquatic systems, where they mediate biogeochemical cycles and form the base of multiple food webs [1]. The dynamics of phytoplankton populations result from the interplay between resource availability and mortality losses [2]. While some loss mechanisms such as grazing are well known, the contribution of loss mechanisms like parasitism remains poorly considered and largely understudied in many aquatic systems. Phytoplankton are susceptible to a wide variety of parasites, such as viruses, bacteria, protists, and fungi. Such parasites can cause mortality of certain phytoplankton species, thereby altering the phytoplankton bloom dynamics and changing the cycling of matter and flow of energy in aquatic ecosystems [3,4,5].

Zoosporic or nanoflagellate parasites that infect phytoplankton comprise a highly diverse functional group of eukaryotic protist and fungal species [6]. They have in common the production of free-living motile stages as their infective propagules, which attach to a phytoplankton host cell and develop either inside (endobiotic) or outside (epibiotic) the host cell using host resources for their growth and reproduction. Due to their inconspicuous nature, phytoplankton parasites are difficult to identify, and objects that are difficult to identify typically tend to be overlooked or neglected. Consequently, although the presence and potential importance of these phytoplankton parasites are increasingly recognized, quantitative data of their occurrence in nature are extremely scarce.

An additional challenge to study of phytoplankton parasites is the need to capture rapid infection dynamics on a relevant temporal and spatial scale (e.g., days). Obtaining quantitative information about parasite infections using traditional methods is labor-intensive and time-consuming, which limits the spatial and/or temporal coverage of many studies investigating phytoplankton–parasite interactions [7].

Recent technological advances in imaging instruments have made it possible to collect large volumes of plankton image data for study of plankton populations, thus opening new research possibilities [8]. The possibility of high-frequency sampling enabled by imaging instruments can potentially result in better understanding of phytoplankton dynamics and their potential interactions with parasites [7]. However, while methods for automatic recognition of phytoplankton classes have been widely developed, methods for automatic recognition of phytoplankton parasite infections remain underdeveloped. The absence of an effective approach for parasitic infection recognition is likely associated with challenges related to obtaining sufficient volumes of image data of plankton parasites, which requires screening of huge amounts of raw image data. Such tasks are best addressed with automated solutions.

The scarcity of plankton parasite images is a major challenge for the development of deep learning-based computer vision methods for parasite detection. While object detection methods such as Faster R-CNN [9] and YOLO [10] have been shown to achieve high accuracy on various detection tasks, including parasite detection (see, e.g.,  [11]), they struggle when the amount of training data is limited. Therefore, a more promising approach is to formulate parasite detection as an anomaly detection task. Here, the idea is to train the model with images of healthy plankton and detect images that deviate from the data on which the models were trained. Due to the availability of large amounts of plankton image data without parasites for training and relatively small intra-class variation among healthy samples, images that deviate notably from the training data can be expected to contain potential parasites.

This work investigates automated image-based phytoplankton parasite detection. The problem is formulated as an anomaly detection problem and solved using an autoencoder. The proposed method consists of a vector-quantized variational autoencoder (VQVAE) [12] that encodes the input image into a compressed latent representation and uses the compressed representation to reconstruct the original image. The rationale is that when the autoencoder is trained only on images of healthy phytoplankton, the autoencoder fails to reconstruct the parasites, which allows them to be detected from the difference image (see Fig. 1). The proposed method further employs the HardNet [13] feature extractor and Local Outlier Factor [14] to distinguish between healthy plankton and plankton with parasites.

Fig. 1
figure 1

Anomalous sample of the Centrales plankton species: a Original, b encoded space, c reconstruction, and d difference image

In the experimental part of the work, an extensive set of different backbone convolutional neural networks (CNNs), autoencoder architectures, feature extractors, and classifiers are systematically evaluated on challenging phytoplankton image data to find the best combination and to demonstrate the performance of the proposed method. In addition, we compare the autoencoder-based anomaly detection method to a Faster R-CNN-based object detector. The results show that the proposed method achieves comparable accuracy to the state-of-the-art Faster R-CNN object detector while requiring no images with parasites for training. Consequently, the autoencoder-based method can be considered a promising approach for utilization in plankton image analysis where the collection of large training data of plankton with parasites is infeasible.

The main contributions of our paper are the development of a novel anomaly detection framework and its application to phytoplankton parasite detection. High-frequency imaging data coupled to automatic presorting of potentially infected plankton allow to capture and quantify infection dynamics on relevant temporal and spatial scales. This is an essential step toward understanding the role of parasites in shaping phytoplankton community dynamics and ecosystem processes. The proposed framework is general and can be applied to other anomaly detection task such as industrial fault control.

2 Related work

Anomaly detection is a data classification technique in which a detector models the representation of samples within a specification (OK) and classifies all samples that deviate from the specification as anomalous (NOK). This problem can be challenging because of potentially high diversity within the NOK samples, imbalance between the number of samples in the OK and NOK groups, and irregularity of the NOK class. A comprehensive overview describing anomaly detection problems, techniques, and categorization is presented in [15].

First introduced for image data in [16], autoencoder (AE) models are now widely used in computer vision. The use of insufficient generalization ability on out-of-training data for an AE model with the aim of detecting anomalies in synthetic and real-world data, in the case studied, telemetry data, was first demonstrated in [17]. The results showed that such AEs can be used to detect previously unseen anomalous samples. The concept was further enhanced and used on image data in, for example, [18] and [19]. A comprehensive overview of AE techniques can be found in [20].

Plankton anomaly detection has been previously studied in the context of open-set recognition, i.e., image classification with the presence of previously unseen classes (plankton species). In [21], the authors presented an unsupervised approach to classify a plankton sample and detect potential significant differences (i.e., anomalies) with respect to the detected class. Image features were extracted using classical computer vision methods utilizing geometrical, moment-based, and other traditional features.

Fig. 2
figure 2

Processing pipeline of the proposed autoencoder-based anomaly detection method

Fig. 3
figure 3

Schemes of the modified autoencoder cores: a BAE1 core, b VAE2 core

In [22], a CNN trained on OK samples and artificial NOK samples derived from the OK data by common data augmentation techniques such as blurring and noise addition was used as the feature extractor. An anomaly score was then computed from these features and used together with the trained feature extractor to distinguish between the OK plankton samples and anomalies. In the work, air bubbles and non-plankton water particles were considered as anomalies.

In [23], the authors used a parallel network of custom statistical classifiers called TailDeTect (TDT) to discover previously unseen plankton species. Each of the TDT classifiers was trained on one particular species, and a sample was considered as unknown if none of the classifiers was able to detect it. Unknown samples were collected and validated by experts. Feature extraction and the concept itself were based on work presented in [21].

In [24], open-set recognition plankton recognition was addressed using a similarity learning approach. Metric learning with angular margin loss was applied to obtain image embedding vectors that model the similarity between images. The anomalies (images from previously unseen classes) were detected by setting threshold values for the similarity.

Faster R-CNN [9] is a popular deep learning (DL) algorithm that has been successfully applied to various domains and tasks, including object detection and anomaly detection [25]. Anomaly detection using Faster R-CNN involves training the model on abnormal images to learn the features of abnormal instances. Then, during classification, the model is used to detect abnormal samples that deviate from the expected outcome. For instance, in industrial manufacturing, abnormal behavior can include machine malfunctions, while in medical diagnosis, it can take the form of unusual patterns in medical images.

An example of anomaly detection is presented in [25], where an improved Faster R-CNN was used to detect defects in steel plates. The algorithm was trained on a dataset of abnormal regions on steel plate images and was able to accurately detect anomalies such as cracks and holes in test images. Using a similar approach, a subtle modification of Faster R-CNN for detection of anomalies in CT images of lungs was considered in [26].

Object detection methods have also been successfully used for parasite detection. For example, in [11], where a YOLOv5 object detector is used to detect a parasitic mite on the body of a honey bee. An overview of other object detection techniques and commonly used datasets can be found, for example, in [27].

In plankton research, Faster R-CNN has been widely adopted for segmentation and object detection. Several object detection approaches, including Faster R-CNN, were utilized in [28] to evaluate a synthetically augmented dataset. Similar work is presented in [29], where a plankton dataset from a darkfield microscope was compiled and then tested with various object detection methods, including YOLOv3 [30], R-CNN [31], and SSD [32].

3 Proposed methods for phytoplankton anomaly detection

In this work investigating detection of phytoplankton samples with anomalies, we primarily employ an unsupervised autoencoder-based approach, followed by the use of different feature extractors and one-class classifiers. Supervised object detection based on the Faster R-CNN [9] is utilized to compare the results of our proposed method with a state-of-the-art approach.

3.1 Autoencoder-based approach

The proposed method to detect anomalous plankton samples is constructed on the framework available in [33]. This implementation allows various combinations of different AE architectures without consideration of the convolutional layers (i.e., fully connected AE, variational AE and others) termed as AE cores, convolutional layers architectures, feature extractors, and one-class classifiers to be tested. In the approach used in this paper, we combined five AE cores, six convolutional encoders and decoders, six feature extractors, and four classifiers (720 combinations in total). The processing pipeline is shown in Fig. 2 and described in more detail in the sections below.

The anomaly detection is based on comparison between the original data and the autoencoder-reconstructed data, followed by feature extraction and one-class classification.

3.1.1 Autoencoder architectures and convolutional layers

As the first step of anomaly detection, we use AE models trained only on OK data to reconstruct unknown input samples of both OK and NOK classes. On account of the non-optimal generalization of the AE models and training only on the OK class of data, we hypothesize that data from the NOK class will be reconstructed worse than data from the OK class, as described in [17].

To better understand the effect of the AE architecture’s core and the complexity of the convolutional encoding and decoding layers, we decided to build our implementation such that the core of the model could be combined with the selected convolutional pairs of the encoders and the decoders. The proposed structure allows us to analyze the contributions of the selected architecture and convolutional layers separately.

We evaluated five different options for the AE cores. As the first alternative, we used implementations of basic convolutional AE [34] as the BAE1 core, convolutional variational AE [35] as the VAE1 core, and vector-quantized AE [12] as the VQVAE1 core. As well as using these cores, we tried to further reduce the features extracted by an encoder by inserting fully connected layers to the basic convolutional AE as the BAE2 core [36] and to the variational AE as the VAE2 core. These modifications to the autoencoder cores are shown in Fig. 3.

We expect that the basic convolutional AE is going to be surpassed in performance by both the variational and the vector-quantized cores because of their non-probabilistic encoding space, which allows leak of more input image’s anomalous parts to the encoded space and reconstructed image. The quality of the reconstructed images should be better in the case of the basic and vector-quantized cores than with variational cores, which typically produce blurry outputs [37]. When training on different classes, the best results are expected from the vector-quantized core, which should create separable clusters for each class in the encoded space.

Besides the AE cores described above, we also consider six pairs of convolutional encoding and decoding layer architectures, whose structure is described in the complementary tables: Table 5 for encoders, and Table 6 for decoders. Each convolutional layer or block described in these tables is complemented with the batch normalization layer. The activation function was set as Leaky ReLu by the ConvM1 architecture and as ReLu for the other architectures.

The tested convolutional layers go from the more complex ConvM2 architecture, suggested for anomaly detection in [19], and ConvM1 architecture, where we expect the ability to reconstruct fine features and details, to the simpler architectures ConvM5, ConvM4 and ConvM3. By using the simpler architectures, we expect that fine features and smaller image structures will be suppressed and the architecture might thus perform better on shape or structure anomalies. The last architecture, ConvM6, is unsymmetrical, as suggested in [38], and uses the more complex encoder of the ConvM5 architecture and the simpler decoder of the ConvM4 architecture. Using this architecture, we expect that anomalies that are propagated to the encoded space will be further suppressed by the decoder reconstruction.

In the optimal case, anomalous areas of the original image are removed during the image reconstruction as shown in Fig. 11. A difference image between the original sample and the reconstructed sample is then computed and used in the feature extraction.

Fig. 4
figure 4

Example feature space of the Aphanizomenon plankton species

3.1.2 Feature extraction

The second step of the framework applies feature extractors to analyze the reconstructions. The features are based on comparison between the original and reconstructed data (Error metrics, HardNet3 and HardNet4) or analysis of the difference image (SIFT feature extraction, HardNet1 and HardNet2).

The first feature extraction approach (Error metrics) creates a low-dimensional feature vector for each image by computing selected error metrics between the original and reconstructed images. The L2 and SSIM metrics applied in [36] are complemented with the Average hash and mean-squared error metrics.

The second feature extraction method (SIFT feature extraction) uses scale and metrics properties of the image keypoints found by the SIFT method. It is a direct re-implementation the approach presented in [39]. The method uses difference images between the original and reconstructed data.

The last four feature extraction methods (HardNet1, HardNet2, HardNet3 and HardNet4) are all based on the batch similarity metric presented in [13]. HardNet1 is the simplest method where each sample is described by the HardNet (HN) feature vector of the original image resized to the size of \(32\times 32\) as required by the original HN implementation. Since such resizing might not be optimal for small anomalies, HardNet2 splits the image of the original size to blocks of \(32\times 32\) and computes the HN feature vector for each such block. The resulting feature vector consists of the norms over those vectors. HardNet3 splits the original and reconstructed images to \(32\times 32\) blocks as in the HardNet2 method, but the resulting feature vector is computed as a cosine similarity between the HN feature vectors of the corresponding blocks of the original and decoded images. HardNet4 uses the same technique, but the cosine similarity is supplemented by the logarithm, which is supposed to emphasize smaller differences of the HardNet3 feature vector.

A 2D visualization of the resulting feature space obtained by the ConvM5-BAE2 autoencoder over the Aphanizomenon plankton species using the HardNet2 feature extractor is shown in Fig. 4. The OK samples form an elliptical cluster, and most of the NOK samples are separate from that cluster.

3.1.3 One-class classification

For the classification part, we used the following one-class classifiers:

  • Robust covariance (RC) [40]: The RC classifier assumes the same distribution for all OK samples and fits an elliptic envelope to the central data point. The anomaly score is computed using the distribution estimations and Mahalanobis distance.

  • One-class SVM (OC-SVM) [41]: The OC-SVM classifier utilizes the support vector machine (SVM) and a nonlinear kernel to create a separating hyperplane of the training data from the origin of the feature space. Samples on the other side of this hyperplane are considered as anomalies.

  • Isolation Forest (IF) [42]: The IF classifier uses random feature selection and splitting to isolate observed samples. The anomaly score is based on the total number of splits. Anomalies are supposed to have a smaller number of splits as it should be easier to separate them.

  • Local Outlier Factor (LOF) [14]: The LOF classifier is based on the local density deviation of the observed point with respect to its k-nearest neighbors. The density of the anomalies should be lower in comparison with the OK samples, which are considered to create denser clusters.

The fraction of anomaly samples for the OC-SVM, IF and LOF was set to 1% since this value is the minimum value of common implementations. Based on the normal distribution, we should also assume that even some OK samples might slightly differ from the majority. All classifiers are fit on the dataset containing only OK samples.

Input features for the one-class classification are normalized using robust scaling, which normalizes the median and the interquartile range, as suggested in [43]. This normalization should be more robust to outliers than simple normalization approaches such as min–max normalization or standardization.

To select the optimal decision threshold for anomaly detection, we use the equal error rate (EER) over the ROC curve of the classifier as shown in Fig. 5. All classifiers are fit only on the OK data, and the ROC curve is obtained from the test dataset.

Fig. 5
figure 5

Illustration of equal-error-rate (EER) threshold selection criterion on the ROC curve

Fig. 6
figure 6

Faster R-CNN architecture

Fig. 7
figure 7

The Faster R-CNN approach for anomaly detection

Fig. 8
figure 8

Object detection tasks using the Faster R-CNN approach: a Plankton versus Anomalies; b Plankton versus Anomalous Plankton; c Anomalies only. The NOK samples are shown in the top row and the OK samples in the bottom row

3.2 Object detection-based approach

The Faster R-CNN [9] algorithm is composed of three main components: a base feature extractor network, a region proposal network (RPN) for extracting the regions of interest, and a detector that uses the region proposals and respective feature maps to classify the detected objects as shown in Fig. 6. The first component is the feature extractor responsible for generating feature maps from the input image. This module is usually a CNN such as VGG-16 [44] or ResNet-50 [45].

The RPN is a kind of fully convolutional network that takes the feature maps from the previous step and returns a set of region proposals that guide the detector on where to find the objects in the image. The proposals and corresponding feature maps from the CNN are then utilized to yield candidate objects with bounding boxes and fixed-length feature vectors using the ROI pooling layer. Finally, these outputs are passed to the R-CNN network. The R-CNN network uses the proposed feature maps to classify each bounding box as an object or background and predict final class scores with the bounding boxes.

For our object detection experiments, we used the Faster R-CNN implementation available from [46] based on the ResNet-50 backbone presented in [47]. To employ an anomaly detection task in the Faster R-CNN baseline, the architecture is supplemented by a one-class classification module based on the predicted object labels, as shown in Fig. 7.

Table 1 Species-specific statistics of the plankton anomaly dataset
Fig. 9
figure 9

Anomalous (left column, or upper row) and non-anomalous samples (right column, or lower row) from all dataset classes of the used dataset

Since anomalies such as parasites are relatively small compared to the image size, it is important to consider the anchor generator which is a part of the region-proposal network. Anchors define regions of an image, usually of different aspect ratios and sizes, that are used as references to detect objects. The anchor generator creates a set of anchors for each location in a feature map; then, for each region of interest, the model predicts which anchor box best encloses the object. The choice of an anchor generator mostly depends on the type of detection task. For example, if we want to detect small objects, then a smaller anchor size should be used. On the other hand, if the task is to detect objects of various sizes, a range of anchor sizes should be defined [9]. Additionally, the aspect ratios of the anchors should match the aspect ratios of the objects in the image.

As suggested in [11], three separate object detectors are considered, each trained on different ground truth: (1) plankton and anomalies, (2) plankton (clean) and anomalous plankton, and (3) anomalies only (see Fig. 8). In the first column, we can see that the model detects a plankton sample in both cases and an anomaly in the top row. The second column shows detection of a plankton sample with anomaly in the top row and detection of a clean sample in the bottom row, and finally, the third column shows detection of an anomaly in the top row only.

4 Experiments

In this section, we describe the datasets used, the evaluation metrics, and the results of the autoencoder-based experiments and the object detection-based experiments.

4.1 Phytoplankton anomaly dataset

Natural Baltic Sea phytoplankton communities are continuously imaged with an Imaging FlowCytobot (IFCB) [48] deployed at Utö Atmospheric and Marine Research Station, Finland (59\(^{\circ }\)46.84’ N, 21\(^{\circ }\)22.13’ E). The IFCB is connected to the station flow-through system, which receives water pumped from an \(\sim \)5 m deep inlet located 250 m offshore, representative of the sub-surface layer. At Utö, IFCB takes a 5-ml sample nearly every 20 min and the system is set to trigger based on the detection of chlorophyll, i.e., targeting phytoplankton cells rather than non-living particles. The research station and IFCB deployment at Utö are described in detail in [49] and [50].

The phytoplankton data from Utö IFCB can be currently classified near real-time into 50 different classes, as described by [51]. Putative parasite infection images were manually annotated by experts based on other Utö data, collected between February and August 2021, using phytoplankton data from nine classes. These classes were selected based on their importance during the spring or summer blooms in the Baltic Sea.

In our experiments, we used a phytoplankton anomaly dataset derived from the annotated images used to train the classifier described above with OK samples from the dataset published in [51] and NOK samples from unpublished 2021 Utö data. The anomaly dataset contains over 6200 manually annotated and expert-validated samples for 9 plankton classes with known anomalies, as shown in Table 1. Non-anomalous and anomalous samples of each class are shown in Fig. 9. As an annotation tool, we used the free version of the Label Studio available at [52]. The annotated dataset is available online at [53] in both COCO and YOLO formats.

4.1.1 Dataset annotations

When annotating the dataset, we used three different labels to define a separate species set:

  • The label Anomaly marks the parasite or other anomalies on the plankton sample.

  • The label PlanktonSpecies_Anomaly marks plankton species with the attached parasite.

  • The label PlanktonSpecies_Clean marks plankton species with or without the parasite.

The last two labels could overlap, but whenever it was possible, the PlanktonSpecies_Clean label does not cover the sample part with parasite. To distinguish between the OK and NOK samples, the PlanktonSpecies_Clean label should be removed if it overlaps with the PlanktonSpecies_Anomaly one.

An example of the annotation over a Dolichospermum plankton species sample is shown in Fig. 10. The red color marks a plankton anomaly and, in this case, the darker green marks the clean sample and the lighter green marks the sample with an anomaly.

Fig. 10
figure 10

Example of the annotation bounding boxes

4.1.2 Derived dataset for autoencoder-based experiment

For the purposes of the autoencoder-based experiment, we used the above-described dataset to derive a one-class dataset with no NOK samples and 70% of the OK sample in the training set. Test and validation datasets always contain a balanced number of OK and NOK samples. The experiment with all plankton species contains all available training samples, 10 validation samples, and 10 test samples from each species.

In order to help the AE model to learn more robust features, we added salt-and-pepper noise to the image samples used during the training with a clean sample used as a label as suggested in [54]. Besides this noise augmentation, we also use random flipping, contrast, saturation, brightness, inversion, and hue augmentation.

Because the HardNet-based feature extractors work correctly only with image sizes of multiples of 32, all samples were resized with respect to the major aspect ratio of each class (1:4 for five classes, 1:1 for three classes and 1:2 for one class) as can be seen in Fig. 9. For the experiment over all classes, we chose the aspect ratio of 1:2 as a compromise.

4.1.3 Derived dataset for object detection-based experiment

For the object detection experiment, the model was trained in a supervised manner. The split ratios were set as 70%, 10%, and 20% for the training, validation, and test subsets, respectively. Training and validation sets do not include clean samples, whereas a test set contains a balanced number of anomalies and clean images.

Additionally, we applied the following augmentation techniques: horizontal and vertical flip with a probability of 30%, and random brightness, contrast and saturation adjustment with a probability of 10%.

4.2 Performance metrics

To compare the results of the autoencoder and object detection experiments, we need to evaluate the predictions of the models with respect to the ground-truth labels. To do so, we can define true-positive (TP) and true-negative (TN) predictions, where the model correctly classifies OK and NOK samples, together with false-positive (FP) and false-negative (FN) predictions, where the model misclassifies NOK samples as OK in the FP case and OK samples as NOK in the FN case.

Precision, Recall and F1 score metrics are used for comparison of the different variations of autoencoders and object detection methods. The metrics are defined as follows:

$$\begin{aligned} Precision= & {} \frac{ TP }{ TP + FP } \end{aligned}$$
(1)
$$\begin{aligned} Recall= & {} \frac{ TP }{ TP + FN } \end{aligned}$$
(2)
$$\begin{aligned} F1= & {} 2 * \frac{Precision * Recall}{Precision + Recall} \end{aligned}$$
(3)

Similarly to precision and recall, we can also define specificity as:

$$\begin{aligned} Specificity = \frac{ TN }{ TN + FP } \end{aligned}$$
(4)

In the autoencoder experiment, we complemented the metrics with the area under the curve (AUC) score. This parameter is defined as the area under the receiver operator characteristics (ROC) curve, an example of which is shown in Fig. 5. This curve is obtained by changing the decision threshold of a binary classifier by a defined step and plotting the resulting specificity on the x-axis and recall on the y-axis for each threshold step. Each point of the ROC curve then corresponds to one threshold setting.

Table 2 Species-specific results with the optimal combination anomaly detection model over all plankton species based on the highest F1-score (ConvM2-VQVAE1, HardNet1, Local Outlier Factor)

4.3 Autoencoder-based approach

Due to the high number of combinations in our framework (five autoencoder types, six pairs of convolutional layers, six feature extractors and four classifiers), we decided to split the experimental results into two parts. The first part describes the optimal combination of model, feature extractor and classifier trained on all datasets together with its selection criteria, and the second part describes the optimal results achieved per plankton species. The optimal combination of autoencoder model, convolutional layers, feature extractor and one-class classifier was determined based on the maximum-achieved F1 score because the detection results appeared to be more consistent in comparison with the AUC metric. This metric is also advantageous because of the easier comparison with the results of the object detector-based approach. The whole implementation was built using the TensorFlow 2 platform [55] and Scikit-learn library [56].

4.3.1 Optimal model for anomaly detection

To select an optimal anomaly detection model, we analyzed the results of all model combinations for the experiment with all plankton species. The best results were achieved with the autoencoder ConvM2-VQVAE1, HardNet1 as the feature extractor, and Local Outlier Factor as the one-class classifier. These results are reported in Table 2, and illustrative examples of the results are shown in Fig. 11.

To further demonstrate that the selected anomaly detection model performs best, we evaluated the F1 score over: (1) model combinations with a fixed feature extractor and classifier (Table 7), (2) feature extractor combinations with a fixed model and classifier (Table 10), and (3) classifier combinations with a fixed model and feature extractor (Table 11). The highest F1 score was 0.75 consistently over all described combinations.

4.3.2 Optimal anomaly detection model per plankton class

Results of the optimal anomaly detection models per plankton class are shown in Table 12. The detection results are approximately 10% better than when using the optimal anomaly detection model trained on all datasets described above, which could be particularly important for the Centrales and Chaetoceros, which showed the lowest values on the performance metrics. This performance gain nevertheless comes at the cost of the need for a separately trained anomaly detection model for each plankton class.

4.4 Object detection-based approach

Results of the Plankton versus Anomalies, Plankton versus Anomalous Plankton and Anomalies Only experiments over all samples are shown in Table 3. The Plankton versus Anomalies experiment contains both large bounding boxes of plankton annotations and small bounding boxes of anomalies. Therefore, the anchor generator was set up such that the sizes of feature map were 16, 32, 64, 128, 256 and 512. In the Plankton versus Anomalous Plankton experiment, only the large bounding boxes were used, and the sizes were 64, 128, 256, 512, and 1024. For the Anomalies only experiment, the sizes were 4, 8, 16, 32, 64, and 128. The scales and the aspect ratios of sizes for each experiment were the same: 0.5, 1.0, 1.5, 2.0, 3.0.

The highest F1 score was achieved with the Plankton versus Anomalies experiment and the lowest one with the Anomalies only experiment. We were able to reach a high F1 score also in the Plankton versus Anomalous Plankton experiment, but the resulting plankton labels were often misleading in this case.

For the object detection approach, one major issue is that the model is incapable to distinguish between plankton parts and anomalies, as shown in Fig 12. This major drawback of the Faster R-CNN object detection method originates from the architecture itself and could not be solved by anchor modifications or other parameter tuning.

Fig. 11
figure 11

Illustration of the results with the ConvM2-VQVAE1 autoencoder model: a original, b encoded space, c reconstruction and d difference image

To have a better comparison with Table 12, we also performed an anomalies only experiment trained on species-specific data, whose results are shown in Table 4. In this experiment, the approach performed worse than the species-specific autoencoder experiment and also worse than the universal model on average.

Table 3 Faster-RCNN detection results
Fig. 12
figure 12

Plankton species without any anomalies recognized as an anomaly by Faster RCNN. The model is confused by a chain of cells and cannot distinguish between plankton parts and anomalies

Table 4 Species-specific dataset Faster-RCNN detection results

We also provide supplementary material as Table 14, Table 13 and Table 15, which show the results of the Plankton versus Anomalies, Plankton versus Anomalous Plankton and Anomalies only experiments with respect to the individual plankton class.

5 Discussion

Learning to detect parasites from phytoplankton images is a challenging problem due to large variation in the appearance of the plankton cells and parasites, the small size of parasites combined with the limited spatial resolution of the images, as well as the scarcity of training data because of the relative rarity of plankton cells with parasites. Even for an expert, it is often impossible to confirm with certainty the parasitic nature of all the attached (non-host) structures from the images alone. For example, spherical structures that are attached to the host cell and have a different appearance to the phytoplankton cell are typically parasites, but they can also be loosely attached free living (i.e., non-parasitic) cells or phytoplankton-cell derived organelles expelled from the cell due to stress. Complexity of those structures makes it infeasible to collect annotated training and test data just on phytoplankton parasites. Therefore, we formulated the problem as anomaly detection where the goal is to detect the phytoplankton cells that deviate from “healthy” cells. The method can be seen as an anomaly detector of putative parasite infections allowing screening of large volumes of plankton image data to obtain a subset of interesting images for further analysis.

A large dataset of phytoplankton cell images with and without anomalies was collected for the study. Untypically for most anomaly detection studies, the collected dataset contains a relatively large number of images with putative parasites (NOK samples). This made it possible to train also supervised object detectors for the task and to compare unsupervised anomaly detection methods with object detector-based methods. We evaluated two approaches to detect whether an image contains anomalies or not: (1) an autoencoder-based anomaly detection approach and (2) a Faster R-CNN-based object detector for anomalies.

For the autoencoder-based approach, a full pipeline consists of the CNN-based autoencoder architecture, feature extraction from the reconstruction and difference image, and one-class classifier was proposed. Various methods were considered for each part of the pipeline and extensively evaluated. The best overall accuracy (F1 score) was obtained using the combination of vector-quantized variational autoencoder (VQVAE) [12] architecture with the CNN backbone by [19], the HardNet feature extractor [13] and the Local Outlier Factor classifier [14]. The F1 score for the best combination varied between different classes from 0.6 (Centrales) to 0.83 (Peridiniella Single) with an overall F1 score of 0.75 over all classes. The ablation study (Tables 8, 9, 10 and 11) demonstrated the superiority of the proposed combination over the other alternatives. The accuracy could be further improved by optimizing the method for each phytoplankton classes separately, as can be seen from Table 12 with F1 scores varying from 0.73 to 0.94. While fine-tuning class-specific models reduces the generalizability of the method, these are promising results for studying parasites on individual plankton species.

The limited amount of training data is a notable challenge for supervised object detectors. To properly learn the large variation in the appearance of anomalies, the training stage would require a sufficient number of example images for each class. The difficulty of the detection task is further emphasized by the fact that the anomalies are typically attached to the host cell and are often very small compared to the plankton cell. These challenges are apparent when observing the Faster R-CNN results (Table 3), where the accuracies are not as high as commonly seen in object detection problems. Three configurations for the R-CNN-based method were evaluated: (1) detection of anomalies only, (2) detection of both anomalies and plankton cells, and (3) detection of healthy plankton cells and plankton cells with anomalies as separate classes. Based on the results, it is evident that learning how normal plankton cells look like is beneficial for the R-CNN. The model trained only on anomalies tends to often detect parts of the phytoplankton itself as anomalies. It was further noticed that when trained on healthy plankton cells and plankton cells with anomalies as separate classes, the detector often fails to correctly detect the bounding boxes in the presence of anomalies but still produces correct classification results (OK vs. NOK). This raises questions about the generalizability of the method.

The Faster R-CNN-based method (model trained on both anomalies and plankton cells) achieved higher accuracy (F1 score: 0.86) than the autoencoder-based method (F1 score: 0.75). This is understandable as the autoencoder-based method did not have access to the images with anomalies during the training stage. The Faster R-CNN-based method is more suitable when enough annotated training data is available. This, however, is not typically the case in plankton anomaly and parasite detection due to the reasons discussed above. The autoencoder-based approach has some notable advantages over supervised object detectors: (1) no training data with anomalies are needed, (2) no annotated bounding boxes are needed, and (3) the method works also with previously unseen anomalies. These advantages together with the comparable accuracy to the R-CNN-based method make the proposed autoencoder method a more promising approach for plankton anomaly detection on new datasets.

Being able to screen large volumes of plankton image data for anomalies has potential to noticeably reduce manual work and allows more extensive research on parasitic infections. Ecologically speaking, separating cells with anomalies is interesting and can lead to new research avenues in the future. Anomaly detections can give a valuable first hint of putative parasite infections (or physiologically stressed phytoplankton). However, further method development is needed to make it possible to distinguish between the different types of anomalies and relate them with more certainty to parasites. An interesting future direction would be to apply clustering methods for anomalies to identify different types of anomalies. The classification of anomaly types could be validated by a wider community effort with expertise on different parasite groups, epibionts and symbionts. In combination with classifying different types of anomalies, different datasets could also be collected from known parasites/epibionts/host-derived anomalies from culture systems. Detected parasites and their presence could be further qualitatively confirmed in parallel by additional methods such as microscopy or eDNA-based approaches.

6 Conclusion

This paper presents an unsupervised anomaly detection approach to detect anomalies, which was tested over nine phytoplankton classes. Although studies exist on plankton anomaly detection in the context of open-set recognition, i.e., detecting previously unseen plankton classes and non-plankton particles, our paper is according to our best knowledge the first one focusing on the detection of small anomalies such as potential phytoplankton parasites or infections in a known set of plankton classes.

We propose an anomaly detection pipeline consisting of a vector-quantized variational autoencoder (VQVAE) in combination with the eight-layer deep convolutional architecture (ConvM2), the HardNet feature extractor and the Local Outlier Factor classifier. With this pipeline, we achieved an average F1 score of 0.75 for all nine analyzed phytoplankton classes. We also suggest that the achieved anomaly detection results could be further improved by optimizing the components of the pipeline for each class separately.

The results achieved with this approach were compared to a supervised Faster R-CNN object detector for experiments in three configurations: (1) detection of anomalies only, (2) detection of anomalies and plankton cells, and (3) detection of plankton cells with and without anomalies. The best results were achieved with the second configuration with an average F1 score of 0.86. Although this score is higher than the one achieved by the autoencoder proposed in this paper, our approach is more universal because it can detect also previously unseen data and it needs no anomalous samples for its training. These benefits make the proposed autoencoder approach more promising for plankton research, where annotated anomaly datasets are not available or their collection is infeasible. The code and dataset used in this study have been made publicly available as a part of this paper.

To identify parasitic infections in phytoplankton, manual scrutiny of the data is still needed as our knowledge of them is still very poor. Our proposed method, however, provides a solution to reduce the workload of manual checking enormously by pointing out the proportion of images an expert potentially needs to go through. This opens a possibility to begin understanding the important role that parasites play in the phytoplankton community dynamics.