1 Introduction

Plankton, including phytoplankton, mixoplankton and zooplankton, is a fundamental component of aquatic ecosystems (Flynn et al. 2019; Glibert and Mitra 2022; Mitra et al. 2023). They form the basis of the food web and are essential for global biogeochemical cycles (Arrigo 2005; Hays et al. 2005). Plankton comprises a diverse array of life forms, which are associated with a variety of functions and possess strong interspecific associations (De Vargas et al. 2015). Aquatic ecosystems have been subjected to changes forced by climate and anthropogenic drivers, which already led to species loss affecting the provision of critical ecosystem services, such as water quality in some regions and productivity (Worm et al. 2006). In order to improve management practices of aquatic ecosystems, it is essential to understand functioning of planktonic communities, the distribution of different life forms and how those are affected by anthropogenic and climate changes (Rogers et al. 2022).

Phytoplankton blooms are observed when favorable conditions trigger algae growth and accumulation in the environment. Although blooms are part of natural productive cycles in aquatic ecosystems (e.g. the increased production during spring time in temperate systems), harmful blooms that are nuisance to recreational use and even hazardous also occur (Anderson et al. 2019; Zohdi and Abbaspour 2019). Due to their importance and potential adverse effect, understanding blooms is essential and efforts towards developing effective observation networks and predictive models have been made (Zhou et al. 2023). Blooms have been traditionally monitored by analyzing fixed samples under the microscope, but despite the high taxonomical information provided by the method, the high costs and time required by the method limit the number of analyzed samples (Zingone et al. 2015). Remote sensing has been employed for increasing coverage of bloom observations, showing that algal blooms have been expanding and intensifying in many coastal areas due to environmental changes (Dai et al. 2023), even though the methods yield limited taxonomical information. Thus, methods that can quickly identify different species at high speed, such as imaging, can enhance our knowledge of the bloom-forming species dynamics for understanding plankton communities and to provide more accurate data for model validation information on a higher time scale is needed (Kraft et al. 2021).

Studying and monitoring plankton is hindered by their microscopic size, fast turnover rates and close interaction with the multiscale hydrodynamics (Benfield et al. 2007). Recent advances in plankton imaging systems have led to their popularization and integration into monitoring programs, collectively accumulating information on plankton systems and simultaneously gathering massive amounts of image data (Benfield et al. 2007; Cowen and Guigand 2008; Lombard et al. 2019; Olson and Sosik 2007; Picheral et al. 2010). The major constraint to the use of these datasets lies in the expert annotation of plankton images, which is expensive, time-consuming, and error-prone. To fully benefit from the technological development and to properly explore the gathered information, there is a clear need for automated analysis methods. During recent years, significant research effort has been put into exploring and developing automated methods for performing plankton recognition based on computer vision techniques and machine learning methods (e.g. Lumini and Nanni 2019a; Orenstein and Beijbom 2017).

The research on automatic plankton image recognition has matured from early works based on hand-engineered image features combined with traditional classifiers such as support vector machine (SVM) (Cortes and Vapnik 1995) and random decision forest (RDF) (Ho 1995) (see e.g. Tang et al. 1998; Sosik and Olson 2007) to feature learning-based approaches utilizing deep learning and especially convolutional neural networks (CNNs) (Lee et al. 2016; Orenstein and Beijbom 2017; Lumini and Nanni 2019a; Kloster et al. 2020). Various custom methods and modifications to general-purpose techniques have been proposed to address the special characteristics of plankton image data. However, despite the high recognition accuracies reported in the literature, these methods have not been widely adapted to the operational use. Many instrument users do not possess computation skills and/or the resources required for implementing custom methods for image recognition and often rely on the default methods that come with the instruments, which typically follow rather simple approaches and do not fully exploit the latest advances in computer vision and machine learning. Deploying deep learning based methods for new environments often requires notable amounts of labeled training data and expert knowledge while publicly available feature engineering based plankton recognition libraries are accessible for non-experts.

Some survey papers on more general microorganism recognition, as well as utilizing machine learning for marine ecology already exist. Zhang et al. (2022) presented a review of machine learning approaches for microorganism image analysis including history, trends, and applications. The paper covers the segmentation, clustering, and classification of various types of microorganism data. Rani et al. (2021) described and compared existing microorganism recognition methods. While the challenges are briefly discussed, the discussion remains on a general level and does not go deeply into the solutions. Li et al. (2019a) provided a review on microorganism recognition for various different application domains with the focus on traditional feature engineering approaches. The survey by Goodwin et al. (2022) covers an even larger scope by addressing the utilization of deep learning methods in marine research. A similar survey was provided by Mittal et al. (2022), who presented existing methods on underwater image classification including fish, plankton, coral reefs, seagrass, and submarines. Bachimanchi et al. (2023) present a brief survey on deep learning methods for data analysis in plankton ecology including recognition, tracking, and biomass estimation. Irisson et al. (2022) provided a plankton recognition review from the application (aquatic research) point-of-view. They present a rather compact survey of the machine learning methods but provide several insights on utilizing machine learning in solving various application-related research questions. Luo et al. (2021b) considered plankton analysis using imaging flow cytometry. In addition to the different imaging technologies, also automatic image analysis methods are reviewed. Those earlier surveys either have considerably wider scope considering various machine learning tasks and organisms, and therefore, not focusing on challenges specific to plankton recognition, or a more narrow scope concentrating on certain technologies for plankton imaging, and thus, lacking a comprehensive review on plankton recognition in general.

In contrast to earlier surveys, we focus on the challenges that researchers commonly face when developing plankton recognition methods and on existing solutions to them. The main goals of this survey are (1) to provide an extensive guide on the available methods to address the challenging characteristics of plankton image data, and (2) to enumerate the challenges that remain unsolved, and which are the most beneficial directions for the future research on the topic. We identify and list the most notable challenges in automatic plankton recognition and provide detailed descriptions of the solutions found in the plankton recognition literature for each challenge. To the best of our knowledge, this is the first comprehensive survey focusing exclusively on plankton recognition and the specific challenges related to it.

The rest of the paper is organized as follows. In Sect. 2, the plankton imaging, i.e., imaging instruments and existing image datasets are reviewed. In Sect. 3, automatic plankton recognition including feature engineering and CNNs are discussed. In Sect. 4, the most notable challenges in plankton recognition are identified. In Sect. 5, the existing solutions for each challenge are described. Finally, the paper concludes with a direction for future research in Sect. 6.

2 Plankton imaging

2.1 Imaging instruments

A fundamental understanding of how plankton species composition is regulated requires frequent and sustained observations. As plankton communities are diverse and dynamic, monitoring plankton is challenging. Different types of plankton imaging and analysis systems have been developed to identify and enumerate living (plankton) and non-living particles in natural waters (Benfield et al. 2007). Instruments designed for monitoring plankton communities are briefly discussed next (see review by Lombard et al. (2019) for more detailed information). The specifications of the main imaging instruments are summarized in Table 1.

Microscopy has been widely employed for analysis of plankton, with most of the standard monitoring of plankton organisms based on brightfield microscopy (Zingone et al. 2015). With the possibility of easy magnification change, microscopy can cover the whole size range of plankton. Added to the potential to be combined with other technologies, such as fluorescence, it can provide a flexible array for visualizing planktonic organisms. When combined with a digital camera, it can generate high quality images at relatively low operational costs, although the amount of images is limited in comparison to other devices. Imaging flow cytometry (IFC) combines fluidics, optical characterization and the imaging of cells/colonies. The Imaging FlowCytobot (IFCB) (Olson and Sosik 2007) and the CytoSense/Cytobuoy (Dubelaar et al. 1999), as well as simpler flow systems such as the FlowCam (Sieracki et al. 1998) and the ZooCAM (Colas et al. 2018) are among the imaging devices most frequently used within aquatic research. The IFCB is a fully automated, submersible instrument with built-in design features that routinely operate during deployments imaging each particle triggering the camera. The CytoSense, available either as a bench top or submersible versions, records forward scatter (FSC), side scatter (SSC) and multiple fluorescence signals of each particle, additionally it can image a subset of the analysed particles. Unlike the IFCB and CytoSense, the FlowCam does not have sheath fluid and it is not an automated in situ instrument. Particle detection in IFCB and CytoSense is triggered by one of the optical sensors (scatter or fluorescence), while FlowCam captures images of a field of view at regular intervals where particles can be identified (autotrigger mode). If the FlowCam is equipped with a laser, particle imaging can be triggered by fluorescence properties, such as the presence of chlorophyll-a. The imaging resolution of the IFCB and CytoSense is targeted for a size range of approximately from larger nanoplankton to smaller mesoplankton. The targeted size range for the FlowCam vary according to the combination of flowcell and objective used and instrument versions for imaging of smaller and larger objects and organisms, FlowCam-Nano and FlowCam-Macro, respectively are currently available and image capture is based on autotrigger. The ZooCAM uses an imaging principle similar to that of FlowCam autotrigger.

For obtaining quantitative information from plankton larger than 100 μm, larger volumes of water are needed to be examined than is possible with IFC (Lombard et al. 2019). For imaging of larger particles different types of instruments have been developed utilizing slightly distinct techniques. There are many commercially available instruments such as the In-situ Ichthyoplankton Imaging System (ISIIS) (Cowen and Guigand 2008), Continuous Plankton Imaging and Classification Sensor (CPICS) (Grossmann et al. 2015), ZooScan (Gorsky et al. 2010), Video Plankton Recorder (VPR) (Davis et al. 2005), Underwater Vision Profiler (UVP) (Picheral et al. 2010), and Lightframe On-sight Keyspecies Investigation (LOKI) (Schulz et al. 2010) which are mostly in situ imaging systems and their operational principles as well as capabilities are reviewed by Lombard et al. (2019). Some instruments have been developed through research purposes but are not commercially available such as the ZooCAM and Prince William Sound Plankton Camera (PWSPC) (Campbell et al. 2020).

Some of the more recent imaging instruments include the SPC (Scripps Plankton Camera) system (Orenstein et al. 2020b), a submersible Digital Holographic Camera (DHC) instrument for temporal and spatial plankton measurements (Dyomin et al. 2020, 2019), and its modification, the miniDHC (Dyomin et al. 2021, 2019). Also HOLOCAM (Nayak et al. 2018), HoloSea (Walcutt et al. 2020; MacNeil et al. 2021), and LISST-Holo are utilized for underwater microscopy using digital holographic imaging (DHI). SPC utilizes an underwater dark-field imaging microscope combined with an onboard computer that allows real-time processing of the images, while the four latter instruments produce 3-D holograms of the imaged volume. The core principal of DHI is in the optical interference phenomenon. A coherent light source, typically a laser, produces the optical interference pattern between undeviated portion of the beam and light diffracted by the object which is recorded on the sensor, and then holograms are reconstructed with pre-/post-processed computer-based algorithms (Watson 2018). The main reasons of emerging DHI microscopy are a wide depth-of-field and field-of-view, i.e., larger sampling volume, and mechanically simpler optical configuration compared to lens-based devices (Walcutt et al. 2020; Watson 2018).As the focus of this review is on image recognition, we stress that the instrument list is not exhaustive and focus only on the most used methods found in the publications surveyed. Some instruments not detailed also include underwater microscopes, scanning electron microscopy and the capacity to image different fluorescent channels, such as the Amnis ImageStreamX Mk II Imaging Flow Cytometer (Cytek) and environmental high content fluorescence microscopy (Colin et al. 2017).

Table 1 Plankton imaging instruments

2.2 Publicly available image datasets

Publicly available image datasets are crucial on the development of the automatic plankton recognition methods since the most labor intensive part of the process is to create large labeled training and testing datasets. The available datasets are also important for the traceability and comparability of the developed methods. There are several publicly available datasets to be utilized in the research for developing the machine learning methods of plankton recognition. However, it is not always clear from the reported results if there are differences in the classification performance among the classes, and if so, which classes perform better than others. This becomes relevant to understand if there are potential class-specific biases in classifiers, which could be associated with specific size classes and robustness of the organisms. The details of the publicly available and commonly used datasets are summarized in Table 2, and example images from the datasets are shown in Fig. 1. The most frequently used datasets are ZooScanNet (Elineau et al. 2018), Kaggle-Plankton (PlanktonSet-1.0) (Cowen et al. 2015), WHOI-Plankton (Orenstein et al. 2015; Sosik et al. 2021) and their manifold task specific subsets. They all comprise grayscale images collected with a single plankton imaging instrument. UVP5/MC dataset (Kiko and Simon-Martin 2020) consists of data collected in the EcoTaxa application (Picheral et al. 2017). A part of the UVP5/MC dataset has been annotated by an expert and part with an automated tool. More recently collected datasets include PMID2019 (Li et al. 2019b), miniPPlankton (Sun et al. 2020), DYB-PlanktonNet (Li et al. 2021b), Lake-Zooplankton (Kyathanahally et al. 2021a), and the one collected by Plonus et al. (2021b). They are acquired with modern imaging instruments and characterized by the presence of color and a higher resolution. SYKE-plankton_IFCB_2022 (Kraft et al. 2022c) and SYKE-plankton_IFCB_Utö_2021 (Kraft et al. 2022a) datasets consist of IFCB images of phytoplankton collected from the Baltic Sea. There are also references to some older commonly used plankton datasets that are not available any more. One example is Automatic Diatom Identification And Classification (ADIAC) database (Du Buf et al. 1999).

Table 2 Existing plankton image data sets
Fig. 1
figure 1

Example images from the publicly available data sets: a Kaggle-Plankton (Cowen et al. 2015); b WHOI-Plankton (Orenstein et al. 2015); c PMID2019 (Li et al. 2019b); d ZooScan (Elineau et al. 2018); e DYB-PlanktonNet (Li et al. 2021b); f SYKE 2022 (Kraft et al. 2022c)

3 Automatic plankton recognition

3.1 Feature engineering

A traditional solution for image classification including plankton recognition is to divide the problem into two steps: image feature extraction and classification (Blaschko et al. 2005; Bueno et al. 2017; Ellen et al. 2015; Grosjean et al. 2004; Sosik and Olson 2007; Zetsche et al. 2014; Barsanti et al. 2021). Ideally, image features form a lower-dimensional representation of the image content that contains relevant information for the classification. The main challenge is to design and select good features that are both general and provide good discrimination between the classes. As a result of feature extraction, the obtained feature vectors are used to train a classifier that can then classify unseen images. The most commonly used classifiers for plankton recognition are support vector machine (SVM) (Bernhard et al. 1992; Cortes and Vapnik 1995) and random decision forest (RDF) (Ho 1995). SVM in its most simplistic form is a binary linear classifier that works by mapping the data points in the feature space in such way that the margin between two classes is maximised. It can be extended to multi-class case, for example, by utilizing multiple binary classifiers and to non-linear classification by using a kernel trick. The RDF is a widely used classification method that is based on the observation that combining several classifiers to form an ensemble typically provides better classification performance than any of the individual classifiers. In a typical RDF, a large number of decision tree classifiers are constructed and the final classification is obtained by computing the mode of individual classifications. This way, the typical problem of overfitting in the case of decision trees is avoided.

The first work on automatic plankton image classification was presented by Tang et al. (1998). The image data were produced using a video plankton recorder (VPR) (Davis et al. 1992) and the proposed method combined texture and shape information of plankton images in a descriptor that is the combination of traditional invariant moment features and Fourier boundary descriptors with gray-scale morphological granulometries. It should be noted that some papers on automatic plankton recognition based on non-image data have been published even earlier. For example, Boddy et al. (1994) utilized light scatter and fluorescence data obtained by flow cytometry to train an artificial neural network (ANN) to classify plankton species.

Finding good image features is essential for any plankton classification system (Cheng et al. 2018; Corgnati et al. 2016). Various feature extraction technologies have been proposed and put into practice for different underwater imaging environments (Sosik and Olson 2007; Zetsche et al. 2014). Frequently used plankton features include texture features (e.g. Mosleh et al. 2012), geometric and shape features (e.g. Tan et al. 2014), color features (e.g. Ellen et al. 2015), local features (e.g. Zheng et al. 2017), and model-based features (e.g. Rivas-Villar et al. 2021). Table 4 in Appendix A categorize and summarize various features used for plankton recognition.

The most commonly used image feature type in plankton recognition is shape features (see e.g. Sosik and Olson 2007; Zetsche et al. 2014) that characterize either the contour or binary mask of the object (plankton). In their simplest form geometric features are numerical descriptors of generic geometric aspects such as major and minor axis length, perimeter, equivalent spherical diameter and area of an object computed from binarized image. Another common approach is to utilize image moments to describe the shape. Both Hu moments (Hu 1962; Thiel et al. 1995; Liu et al. 2021a; Zhao et al. 2005, 2010) and Zernike moments (Khotanzad and Hong 1990; Blaschko et al. 2005) have been proposed for plankton recognition. Also, various advanced features quantifying the shape of the contour have been proposed for plankton data. These include boundary smoothness (e.g. Tang et al. 2006; Liu and Watson 2020), affine curvature descriptors (Liu and Watson 2020), Freeman contour code features (Rodenacker et al. 2006), and elliptical Fourier descriptors [Sánchez et al. 2019a; Beszteri et al. 2018). Further geometric features applied for plankton recognition include symmetry measures (e.g. Hausdorff distance (Guo et al. 2021c; Sosik and Olson 2007)] and granulometries (Kingman 1975) utilizing morphological operations (Luo et al. 2005; Kramer 2005; Tang et al. 2006; Wu and Sheu 1998).

Other frequently used type of features in plankton recognition systems are texture features that quantify spatial distribution of intensity or color values in local image regions. While shape features consider only the boundary of plankton, texture features describe the region inside the boundary. The simplest texture features commonly applied in plankton recognition are first-order statistical descriptors that compute simple statistical values directly from the intensity values (see e.g. Lisin 2006; Zetsche et al. 2014; Guo et al. 2021c). These are sometimes called color features and include, for example, mean intensity, variance of intensity, as well as, skewness and kurtosis that quantify the shape of the color or intensity histogram. The first order statistics only provide information on how the intensity or color values are distributed in the image. To obtain further spatial information on texture, various second-order statistical descriptors have been proposed. The most common second-order statistical descriptor used in plankton recognition is the co-occurrence matrices (Hu and Davis 2005; Liu et al. 2021a; Shan et al. 2020; Wei et al. 2022), that describe the statistics of pixel color pairs occurring with certain distance from each other in the image. More advanced texture features proposed for plankton recognition include Local Binary Patterns (LBP) (Ojala et al. 2002; Schulze et al. 2013; Chang et al. 2016; Lisin 2006; Yu and Sun 2023), and Gabor descriptors (Idrissa and Acheroy 2002; Sánchez et al. 2019b; Bueno et al. 2017).

The third widely utilized group of image features is local features that typically combine the feature detectors and descriptors. Feature detectors search the image for characteristic interest points or regions that contain useful information for the task, i.e. plankton recognition. Local feature descriptors then quantify these regions. General-purpose feature descriptors that have been applied for plankton images include Histogram of Oriented Gradient (HOG) (Dalal and Triggs 2005; Bi et al. 2015; Guo et al. 2021c), Scale Invariant Feature Transform (SIFT) (Lowe 2004; Tsechpenakis et al. 2007), Speeded Up Robust Features (SURF) (Bay et al. 2006; Chang et al. 2016), Inner-Distance shape context (IDSC) (Ling and Jacobs 2007; Zheng et al. 2017), and Phase congruency descriptors (PCD) (Kovesi 2000; Sánchez et al. 2019b; Verikas et al. 2012).

Feature engineering-based methods for plankton recognition usually combine features from different groups to obtain more representative feature vectors. For example, Zheng et al. (2017) used geometric features (e.g. size and shape measurements, such as area, circularity, elongation, convex rate), color features (e.g. sum, mean, standard deviation of color values), texture features [e.g. Gabor descriptors and Local Binary Pattern (LBP)] and local features (e.g. HOG and SIFT). Sosik and Olson (2007) applied simple geometry features, shape and symmetry features, as well as texture features including co-occurrence matrices for phytoplankton recognition. Wacquet et al. (2018) extracted 26 features including basic shape features, advanced morphological features, and color features.

Typical plankton recognition systems further apply additional feature selection (see e.g. Zheng et al. 2017) or dimensional reduction steps to construct compact feature representations. In feature selection, the large set of initial features are ranked based on how representative or informative they are, and the least informative features are discarded. For example, Tang et al. (2006) proposed normalized multilevel dominant eigenvector estimation (NMDEE) technique to select a best feature set for plankton recognition. In dimensional reduction, principle component analysis (PCA) or similar technique is applied to reduce the length of the extracted feature vector while preserving maximum amount of information. For example, Li et al. (2014) and Chang et al. (2016) utilized PCA as a part of the plankton recognition system.

Although feature-engineering-based techniques have been applied with promising results, they require discrete parts, i.e., feature extraction, selection, and training a classifier. Due to the difficulty of finding general features that provide high classification accuracy over different datasets, feature engineering based plankton recognition methods are often ad-hoc solutions tuned for a single imaging instrument and provide limited accuracy. Moreover, based on previous works (Al-Barazanchi et al. 2015b; Khalid et al. 2014), it typically requires extensive manual work to integrate a new class to the existing system. Each new class requires intensive work to find new features that could represent the new class. Depending on the quality of feature design, providing a suitable framework for the accurate, rapid and simplified classification of plankton species is not always possible.

3.2 Convolutional neural networks

Recently, CNNs have replaced traditional feature engineering techniques in various computer vision applications. The notable difference is that the image features are learnt from the data instead of manually designing them. CNN (LeCun et al. 2015) is a type of neural network model for image processing inspired by the animal visual cortex. The key component of CNNs are the convolutional layers that consist of neurons each processing data only for their receptive field. Due to the shared-weight architecture, these neurons fundamentally perform the convolution operation to the input with a filter defined by the weights of the neurons. This makes it possible to learn the feature extraction filters (weights) through backpropagation. A typical CNN involves repetitions of several convolution layers and a pooling layer, followed by a set of fully connected layers. The convolution and pooling layers perform feature extraction and the fully connected layers perform the higher-level reasoning and map the extracted features into final output. Increasing the amount of convolutional layers (the depth of the network) allows to represent more complex relations between features often leading to a better recognition accuracy while increasing the amount of parameters. An example of CNN structure is shown in Fig. 2. In the recent years CNN-based approaches have become dominant in various image analysis tasks providing state-of-the-art performance, for example, in image classification, object localization, and image segmentation tasks (Teuwen and Moriakov 2020). Fig. 3 illustrates how the popularity of the CNNs and feature engineering based approaches on plankton recognition have changed over the years. It can be seen that the introduction of CNNs clearly boosted the research in the field.

Fig. 2
figure 2

Architecture of a typical convolutional neural network

Fig. 3
figure 3

Popularity of feature engineering and feature learning [CNNs and Vision Transformers (ViTs)] based methods on plankton recognition

The first works considering CNN-based classification of plankton images were carried out in 2015. Zheng and Wang (2015) carried out preliminary experiments on applying CNN for automated plankton recognition. A small CNN-model (3-5 layers) was tested on zooplankton data. Similarly, Kuang (2015) used CNN together with data augmentation to solve the recognition task. Al-Barazanchi et al. (2015a) proposed a hybrid solution where CNN was used for plankton image feature extraction and RDF and SVM for classification.

One reason why CNNs have become more popular is that they have been shown to outperform the traditional approach utilizing feature engineering multiple times and the architectural components have been studied with care (Gu et al. 2018). For example, Zheng and Wang (2015) compared a CNN-based plankton image classifier to traditional classifiers such as a multi-layer perceptron (MLP) model utilizing hand-engineered features. The results showed that CNN outperformed the earlier methods. In various experiments (Orenstein et al. 2015; Orenstein and Beijbom 2017; Guo et al. 2021c), CNNs have demonstrated higher plankton recognition accuracy than RDF combined with hand-selected features. The preliminary experiments done by Mitra et al. (2019) on planktonic foraminifera species suggest that CNN can even surpass the human in plankton recognition accuracy for certain cases, in which the taxonomy is nuanced. However, in some special cases, if the computation time is heavily restricted (e.g. embedded systems), feature-engineering based approaches might still be preferable (see e.g. Zimmerman et al. 2020).

3.2.1 CNN architectures

Numerous CNN architectures have been suggested for plankton recognition. These include various common CNN developed for generic image recognition such as AlexNet (Krizhevsky et al. 2012), VGGNet (Simonyan and Zisserman 2014), GoogLeNet (Szegedy et al. 2015), and ResNet (He et al. 2016). AlexNet was the first deep CNN applied to general image recognition and contains 8 layers. VGGNet uses smaller convolution filters (3 \(\times \) 3) compared to AlexNet to obtain deeper networks (up to 19 layers) and more nonlinearity while reducing the number of parameters. GoogLeNet and its modifications (e.g. InceptionV1 and InceptionV3) utilize inception modules that apply convolutional filters of different sizes simultaneously to capture information at various scales. ResNet introduced a residual block that uses the shortcut connection. This allows to avoid vanishing gradient problem while training very deep networks (up to 152 layers).

Lumini and Nanni (2019a) compared AlexNet, DenseNet (Huang et al. 2017), ResNet, VGGNet, GoogleNet, and SqueezeNet (Iandola et al. 2016). DenseNet produced the best classification results with ZooScan, Kaggle-Plankton and WHOI datasets. Liu et al. (2018a) evaluated AlexNet, VGG16 (Simonyan and Zisserman 2014), GoogLeNet, PyramidNet (Han et al. 2017) and ResNet. The results suggest that PyramidNet provided improvement on accuracy on a WHOI-Plankton dataset. Sánchez et al. (2019b) performed a comparison of ResNet, AlexNet, VGGNet, SqueezeNet, DenseNet, and InceptionV3 (Szegedy et al. 2016) on a dataset consisting of 1085 diatom images of 14 different classes and DenseNet, ResNet and VGG provided the highest accuracy. Kloster et al. (2020) tested extensively various CNN architectures. Notably, relatively shallow VGG-16 model outperformed more modern architectures. Table 5 in Appendix A gives a summary of different architectures that have been utilized in plankton recognition.

There are also CNN architectures developed specifically for plankton recognition. Al-Barazanchi et al. (2018) proposed a shallow VGGNet-based architecture for the task. Dai et al. (2016a) proposed a CNN architecture called ZooplanktoNet that was characterized by the ability to capture more general and representative features than previous predefined feature extraction algorithms. It was strongly inspired by AlexNet and VGGNet. A comparative experiment with different CNN architectures including AlexNet, VGGNet and GoogleNet was carried out and ZooplanktoNet was found to outperform other architectures on zooplankton classification. Yan et al. (2017) proposed another light CNN architecture for plankton recognition by utilizing smaller filter size and less fully-connected layers. Li et al. (2019c) proposed tiny attention network (TANet) consisting of three main parts: a reduction module, self-attention operation, and group convolution. The reduction module was utilized to reduce the information loss caused by pooling operation, self-attention was used to improve the feature learning ability and the group convolution was applied to compress the model size. One of the benefits of the TANet model is its small size which allows real-time classification on mobile devices. Luo et al. (2021a) presented a custom architecture MCellNet derived from MobileNetV2 (Sandler et al. 2018). The model was shown to outperform MobileNetV2 on plankton data on both accuracy and computation time. Xu et al. (2022) developed a CNN for classifying algae based on ResNet and SeNet architectures. Benammar et al. (2021) applied to a custom architecture utilizing 3D convolutions to image data collected using environmental high content fluorescence microscopy.

Custom architectures have also been developed for holographic microscopy images as existing image recognition models cannot be directly applied to raw digital holographic microscopy data. A straightforward approach is to first reconstruct images and then utilize any common image recognition architecture (see e.g. Qiao et al. 2021; MacNeil et al. 2021). This, however, leads to long processing times as the reconstruction stage is computationally heavy. It has been shown that by using a custom architecture CNNs can be successfully applied to the raw digital holographic data and the reconstruction step can be avoided (Guo et al. 2021a; Zhang et al. 2021). Also, simulated holograms have been proposed for training and testing simultaneous detection and classification of plankton (Scherrer et al. 2021).

Various works (Orenstein and Beijbom 2017; Rivas-Villar et al. 2022) have suggested to use CNNs only for the feature extraction and utilize other classifiers, such as SVM or RDF for the final classification step. Jindal and Mundra (2015) suggested to use output of the first fully-connected layer of two CNNs (ClassyFireNet and GoogLeNet) as image features and fed it to RDF for plankton recognition. Similar approach was evaluated by Orenstein and Beijbom (2017) who utilized AlexNet to extract features for an RDF-based classifier. Sánchez et al. (2019b) compared both approaches: fine-tuned CNN for classification, and CNN for feature extraction. Based on the experiments with various CNN architectures fine-tuned CNN outperformed the approach where CNN was used as feature extractor.

Other commonly used approach is to combine multiple CNNs into ensemble to improve the accuracy. This so called ensemble learning is based on the assumption that limited performance of an individual recognition model can be compensated by utilizing additional models more capable of classifying different sets of classes. Kuang (2015) proposed various approaches for model ensemble. These include averaging softmax probabilities and applying principal component analysis for concatenated CNN features before softmax classifier. Lumini and Nanni (2019a) and Lumini et al. (2020) proposed an ensemble of classifiers by score fusion. Various classifier combinations containing different CNN models were evaluated for both plankton and coral classification. Henrichs et al. (2021) proposed an ensemble of 6 CNNs and showed it to outperform an RDF-based classifier. Kyathanahally et al. (2021b) compared various CNNs architectures in ensemble with multilayer perceptron (MLP) on zooplankton recognition using a mix of feature descriptors and CNNs features. Yang et al. (2023) applied an ensemble of CNNs to harmful algae recognition. To avoid false negatives, images were selected for further expert verification if any of the five CNNs models classified the image as harmful algae. While ensemble learning has shown slightly improved recognition accuracy, it also increases the computation time and complicates the training process.

3.2.2 Hybrid methods

Multiple methods that aim to combine the feature engineering approach with CNNs have been proposed. One approach is to utilize a separate classifier (e.g. RDF) as above. This way CNN features can be simply supplemented with selected hand-engineered features before classification (see e.g. Orenstein and Beijbom 2017; Keçeli et al. 2017). Similarly, ensembles of classifiers can be utilized to combine handcrafted feature based classification and CNNs. For example, in the method proposed by Lumini and Nanni (2019a); Lumini et al. (2020) individual classifiers utilized in the ensembles included various CNNs applied to both original images and preprocessed (filtered) images. The preprocessing techniques included various filters commonly used to compute local features, such as gradient, LBP and wavelets. Rivas-Villar et al. (2021) combined color and texture features with deep CNN features. Both RDF and SVM were tested for classification. Dai et al. (2016b) proposed a multi-stream CNN for plankton classification, where multiple inputs are processed in parallel as different streams before merging or concatenating the features for the classification. In addition to the original image, global feature image representing the shape and local feature image representing the edge information were used as input. Similar approach was proposed in paper by Cui et al. (2018), where the original image, shape image, and texture image were processed in streams before feature concatenation. Concatenated feature maps were processed with one more convolutional and pooling layer, a set of fully connected layers and softmax layer. A related approach was proposed by Ellen et al. (2019) who utilized non-image information (metadata) in the CNN-based plankton classification. Various architectures to fuse Metadata with CNN-based image features were proposed consisting of a set of convolutional and pooling layers for the image and fully-connected layers for the metadata before feature concatenation and common fully-connected layers for the classification. Similar hybrid models were utilized also in (Benammar et al. 2021).

Also various other modifications to baseline CNN classifiers exist. Kosov et al. (2018) proposed Conditional Random Field model to utilize spatial relations among pixel-based CNN classification results and global features for microorganism detection and recognition. Liu et al. (2018b) proposed to include squeeze-and-excitation block (Hu et al. 2018) to deep pyramidal residual network to increase the plankton recognition accuracy. Luo et al. (2018) took into account the fact that typical plankton images contain a large amount of background pixels without useful information and applied spatially sparse convolutional neural networks originally developed for handwriting recognition (Graham 2014). Cheng et al. (2020) proposed to combine two CNNs, one applied to normal Cartesian coordinate image and one to the same image transformed into Polar representation. This way rotational invariance was obtained in addition to the translation invariance of the baseline CNN.

3.2.3 Transformers

In addition to CNNs, also other feature learning approaches have been proposed for plankton recognition. One of the most promising approach is ViTs (Dosovitskiy et al. 2021), that works by dividing the image into patches resulting in a sequence of vectors (tokens) that are fed to the model. The architecture allows the model to measure relationships between pairs of image patches making it possible to learn to identify the most informative regions in an image via self-attention. Kyathanahally et al. (2022) applied ensembles of Data-efficient image Transformers (DeiTs) for various ecological image datasets including four publicly available plankton datasets and provided state-of-the-art performance. Maracani et al. (2023) evaluated three different modifications of transformers: ViTs, Hierarchical Vision Transformer (Swin) (Liu et al. 2021b), and Image Transformer pre-trained on a large language model (BEiT) (Bao et al. 2021).

3.2.4 Plankton detection

Depending on the imaging instrument, there is sometimes a need to first detect the plankton particles in the images (Moniruzzaman et al. 2017; Cai et al. 2022; Chen et al. 2023). Plankton detection can be applied to two main types of images: single-specimen/specimen (including colonial forms) focused images and multi-specimen images. Specimen focused images are automatically centered and cropped to show only one specimen (see Fig. 1). While the plankton recognition on such data can be treated as an image classification task, in some cases a detection step might still be needed due to other plankton particles or detritus on the background. Multi-specimen images are those that capture multiple different plankton particles in one frame, such as those obtained using general-purpose microscopes. These contain multiple different plankton particles that need to be detected and recognized separately. While detection itself is out of the scope of this survey, we will briefly review the existing methods that both detect and recognize plankton focusing mainly on multi-specimen images.

Modern CNN-based object detection methods such as R-CNN (Girshick et al. 2014), YOLO (Redmon et al. 2016), and their variants perform the detection and recognition simultaneously, providing end-to-end methods for plankton recognition. For example, Pedraza et al. (2018) applied R-CNN to detect and classify diatoms in microscopy images, and Soh et al. (2018) used YOLO to detect and recognize plankton. Wang et al. (2022b) compared multiple CNN-based object detection methods including Faster R-CNN (Ren et al. 2017), SSD (Liu et al. 2016), YOLOv3 (Redmon and Farhadi 2018) and YOLOX (Ge et al. 2021) on imaging flow cytometer data. YOLOX achieved the best accuracy. Chen et al. (2023) explored a family of YOLOv5 (Jocher 2020) architectures in the automated video-oriented plankton detection and tracking workflow.

While typically detection methods are applied on multi-specimen images, they have been proposed for recognition of single-specimen focused images. Li et al. (2021c, 2021d) proposed an improved YOLOv3-based model for plankton detection on IFCB images. The proposed model contains two YOLOv3 networks fused with DenseNet architecture. Kosov et al. (2018) applied CNN-based images, features and conditional random fields for plankton localization and segmentation.

Similar to modern detection methods, also semantic and instance segmentation methods can be applied to simultaneously detect and recognize plankton. Ruiz-Santaquiteria et al. (2020) compared a semantic segmentation model called SegNet (Badrinarayanan et al. 2017) and instance segmentation model called Mask R-CNN (He et al. 2017) on algae detection and recognition.

3.2.5 Comparisons

Many papers utilize in-house datasets and most publicly available datasets do not provide standardized evaluation protocol meaning that different papers utilize different train-test splits and performance metrics. This makes comparison of the performance of different solutions challenging before the principles of making the science findable, accessible, interoperable, reusable (FAIR) are fully adopted (Schoening et al. 2022). Table 3 summarizes some published results obtained on publicly available datasets. However, the provided accuracies are not directly comparable due to the reasons mentioned above. One notable comparison of plankton recognition methods is The National Data Science Bowl (Aurelia et al. 2014) from 2015. The winning team used an ensemble of over 40 convolutional neural networks.

Table 3 Example accuracies on publicly available datasets from various sources

4 Challenges in plankton recognition

Based on the literature on automatic plankton recognition various challenges can be identified. The most notable challenges are as follows:

  1. 1.

    The amount of labeled data for training is limited. This challenge can be divided into two subchallenges: (1) expert knowledge is required for data labeling, and (2) certain plankton species are notably less common producing a small amount of example images. Plankton species are inherently difficult to identify, requiring prior expertise. Labeling image data for training and evaluation purposes must be done by experts (e.g. plankton taxonomists) ruling out crowdsourcing tools such as Amazon Mechanical Turk commonly used for labeling large datasets. This makes labeling expensive limiting the amount of labeled data. It also takes years to accumulate enough data to cover rare species. Collecting a labeled training set is essential for deep learning models. Considering that morphological plasticity can be found for all planktonic organisms, larger amount of labeled training data increases the model’s capacity to generalize to new data while training a large model with a small number of examples increases the risk of overfitting, i.e. learning the noise in training data causing the model to perform poorly on unseen images.

  2. 2.

    There is a large imbalance between classes. Image classification with datasets that suffer from a greatly imbalanced class distribution is a challenging task in the computer vision field. Data of plankton species naturally exhibit an imbalance in their class distribution, with some plankton species occurring naturally more commonly than others. This results in highly biased datasets and makes it difficult to learn to recognize rare species, having a serious impact on the performance of classifiers. Furthermore, with highly unbalanced datasets the overall classification accuracy (e.g. percentage of images that were correctly classified) provides little information about the classes with a small number of samples which may bias the evaluation of the goodness of the classification methods.

  3. 3.

    Visual differences between certain classes are small. Certain plankton species, especially those that are taxonomically close to each other and/or have reduced size, resemble each other visually, which renders the recognition task a fine-grained classification problem. Limitations in the amount of labeled training data make it challenging to ensure that the recognition model learns the subtle differences between the classes reducing the recognition accuracy.

  4. 4.

    Imaging instruments vary between datasets. If two datasets have been obtained with different imaging instruments producing visually different images (domain shift) the classification model trained on one dataset does not provide sufficient classification accuracy on the other dataset when applied directly. This makes it challenging to develop general-purpose classifiers that could be applied to new datasets limiting the applicability of the existing publicly available large image datasets. There is a need for approaches that allow the adaptation of the trained models to new imaging instruments.

  5. 5.

    Labeled training sets do not contain all the classes that can be captured. When deploying a recognition model in operational use, it should be able to handle images from the classes that were not present in the training phase. Different datasets often have different sets of plankton species due to, for example, the geographical distance between the imaging locations or the particle size range of the imaging instruments. Moreover, imaging instruments capture images of unknown particles. Typical CNN-based classification models trained on one dataset tend to classify the images from a previously unseen class to one of the known classes often with high confidence, which not only makes the models incapable to generalize to new datasets and analyze noisy data but makes it difficult to recognize when the model fails. This calls for methods that can identify when the image is from a previously unseen class (species).

  6. 6.

    There are uncertainties in expert labels. Due to limited imaging resolutions and low image quality, recognizing plankton species is often difficult even for an expert. Manually labeling large amounts of images is tedious work increasing the risk of human errors. Moreover, due to the high costs of labeling work, it is typically not possible to obtain opinions from multiple experts for each image. These reasons cause inaccuracies (uncertainty) in labels to the training data decreasing the classification performance of the trained models. Furthermore, this uncertainty is often highly imbalanced since some of the classes are easier to identify than others.

  7. 7.

    Variation in image size and aspect ratio is very large. Most CNN architectures require that the input images have fixed dimensions and a typical approach in image classification is to first scale the images into a common size. This is not ideal in plankton recognition due to a very large variation in both the size and aspect ratio of plankton. Scaling images into a common size may cause either small details to be lost in the large images (downscaling) or very large and computationally heavy models (upscaling). Furthermore, the size is an important cue for recognizing the plankton species and this information is lost in scaling.

  8. 8.

    Image quality can be low or have extensive variation. Plankton imaging requires high magnification and the (natural) water might contain other particles, cause unwanted optical distortions, as well as limit the visibility. More importantly, due to the limited depth-of-field, automated imaging instruments often fail to capture particles in focus and the focus may drift away from optimal setting. These reduce the quality of images. The low image quality makes both manual labeling (Challenge 6) and automatic classification considerably more challenging. Therefore, there is a need for plankton recognition solutions that are robust to image distortions such as blur and noise.

  9. 9.

    The amount of image data is massive. Modern plankton imaging instruments produce massive amounts of image data, e.g. FlowCam Macro and ISIIS have the ability to take 10,000 images per minute and 64,000 images per hour respectively. Computationally efficient solutions are needed to perform the analysis in real-time (MacLeod et al. 2010; Orenstein et al. 2015).

All the nine challenges are visualized in Fig. 4.

Fig. 4
figure 4

The nine main challenges that complicate the introduction of automatic plankton recognition methods to operational use

5 Existing solutions

5.1 Challenge 1: Limited amount of labeled training data

The two main reasons limiting the amount of labeled training data, the requirement of expert knowledge for the very laborious labeling task and rarity of certain plankton species, require different solutions.

Active learning has been utilized to minimize the effort of expensive human experts in labeling plankton image data (Luo et al. 2005; Li et al. 2021a). The basic idea behind active learning is to select only the most informative samples for labeling. A classifier is first trained on a small initial training set and the method iteratively seeks to find the most informative samples from an unlabeled dataset. These samples are then labeled by a human expert and the model is re-trained. A simple active learning technique for plankton images called "breaking ties" was proposed by Luo et al. (2005). The method utilizes probability approximation for SVM-based classifier and ranks the unlabeled images based on the differences between the largest and the second largest class probabilities (the smaller the difference the less confident the classifier is). Images with the smallest confidence were labeled by an expert. Drews et al. (2013) studied semi-automatic classification and active learning approaches for microalgae identification. A Gaussian mixture model (GMM) model is estimated from the image feature data and three different sampling strategies are used for the active learning. The experimental results show the benefit of using active learning to improve the performance with few labeled samples. Bochinski et al. (2018) proposed Cost-Effective Active Learning (CEAL) (Wang et al. 2016) for plankton recognition. In contrast to traditional active learning where only the manually annotated samples are used in the model training, CEAL utilizes also the unlabeled high-confidence samples for training with class predictions as pseudo labels. Haug et al. (2021b); Haug (2021); Haug et al. (2021a) proposed Combined Informative and Representative Active Learning technique (CIRAL) to minimize the human involvement in the plankton image labeling process. The main idea behind the method is to find the images with minimal perturbations that are often miss-classified and ignore the images that are far from the decision boundary. The DeepFool algorithm is used to compute small perturbations to the images. The finding of the representative images is formulated as a min-max facility location problem and solved using a greedy algorithm.

While active learning helps to reduce manual work, it is often still a time-consuming process. Typically, there is a need to obtain more training data in a completely automated manner. A traditional approach to increase the amount of training data is to utilize data augmentation. By augmenting the existing labeled image data with various image manipulations, the diversity of the training data, and therefore, the generalizability and accuracy of the trained model can be improved. The most commonly used data augmentation techniques for plankton image recognition include various geometric transformations (Orenstein and Beijbom 2017; Vallez et al. 2022) including rotation (e.g. Cheng et al. 2019; Correa et al. 2017), shearing (e.g. Dai et al. 2016a; Geraldes et al. 2019), flipping (e.g. Ellen et al. 2019; Geraldes et al. 2019), and rescaling (e.g. Li and Cui 2016; Luo et al. 2018). Also, additional noise (e.g. Correa et al. 2017; Geraldes et al. 2019), blurring (Geraldes et al. 2019), contrast normalisation (Geraldes et al. 2019), as well as adjusting brightness, saturation, contrast, and hue (Dunker et al. 2018) have been utilized. Some works augment images using translation (e.g. Dai et al. 2016a; Li and Cui 2016). However, it should be noted that, unless the translation is used to cut an image, CNNs are invariant to translation by design, and therefore, this is typically unnecessary when CNNs are used for recognition. Augmentation has been shown to increase the plankton recognition accuracy even with relatively large training sets (see e.g. Song et al. 2020). Examples of augmented images are shown in Fig. 5.

Fig. 5
figure 5

Examples of data augmentation methods

Another commonly used approach to address a small amount of labeled training data is transfer learning. Transfer learning is a machine learning method that utilizes knowledge gained from the source domain, where labeled training data are abundant, to the target domain, where labeled training data are scarce (Pan and Yang 2009; Shao et al. 2014; Weiss et al. 2016) (see Fig. 6). In the context of plankton recognition, this typically means that the model is first trained using either general image datasets ([e.g. ImageNet (Deng et al. 2009)] or a large publicly available plankton dataset and then fine-tuned for the target plankton dataset with typically a limited number of labeled images. Using general image databases as source data is justified by the fact that the learned low level image features are often useful despite the classification problem. In the simplest case transfer learning can be done by simply replacing and training the classification layer and keeping the feature extraction layers unchanged (see e.g. Mitra et al. 2019). However, it is often beneficial to use the pre-trained network only for initialization and retrain (or fine-tune) the whole network with the target dataset (Lumini et al. 2020). Existing studies on WHOI-Plankton dataset suggest that using pre-trained models and fine-tuning them for plankton data (see e.g. Lumini and Nanni 2019a) can achieve significantly higher accuracy than training the models from scratch on plankton data (see e.g. Liu et al. 2018a).

Fig. 6
figure 6

The difference between Traditional machine learning and Transfer learning

One way to apply transfer learning for plankton images is to use trained CNNs only for feature extraction and utilize general classification methods such as SVM or RDF for the recognition (see e.g. Rodrigues et al. 2018; Rawat et al. 2019). However, the results by Orenstein and Beijbom (2017) suggest that better accuracy is obtained by utilizing end-to-end CNN with classification layers. Lumini and Nanni (2019a); Lumini et al. (2020) evaluated various strategies for transfer learning on plankton images. The first strategy was to initialize the model with ImageNet weights and fine-tune the whole model with plankton data. In the second strategy (two rounds tuning), a second pre-training step utilizing out-of-domain plankton image data was added before the fine-tuning. In the third strategy, ensembles of multiple different models were used. Based on the experiments the two rounds tuning did not provide a notable improvement in accuracy. Similarly, Guo et al. (2021b) explored and compared multiple transfer learning schemes on several biology image datasets from various domains. Various underwater and ecological image datasets are utilized for multistage transfer learning, where ImageNet pretraining is first improved by fine-tuning on an intermediate dataset before, finally, training on the target dataset consisting of plankton images. The experimental results show the potential of cross-domain transfer learning even on the out-of-domain data when the number of samples in the target domain is insufficient.

Large models with more parameters typically require a large amount of data to be trained without overfitting the model. To avoid this and allow the training with a smaller amount of data, shallower CNN architectures have been proposed for plankton recognition. For example, the 18-layer version of ResNet architecture has been shown to achieve a high plankton recognition accuracy on IFCB data (Kraft et al. 2022b). Most custom CNN architectures developed especially for plankton recognition including ClassyFireNet (Jindal and Mundra 2015), TANet (Li et al. 2019c), and ZooplanktoNet (Dai et al. 2016a) are relatively shallow with 8, 8 and 11 layers, respectively. It has been shown that a good classification accuracy could be obtained with a shallow architecture and by using suitable data augmentation methods even with as few as 10 images per class (Kraft et al. 2022b).

In addition to data manipulation and custom recognition models, also model training approaches have been considered to address the limited data amounts. Learning techniques developed for training the classifier with a minimal amount of samples are called few-shot learning methods. Typically, the idea is to utilize some prior knowledge to allow the generalization to new tasks (in this case classification of new plankton species) containing only a few labeled training examples. Common ways to address few-shot learning is to utilize generation (Hariharan and Girshick 2017), embedding or metric learning. The basic idea is to learn such embeddings that the images from the same class are close to each other in the metric space and images from the different classes are far. This allows performing the plankton recognition using distances to the images with known plankton species. Embedding and metric learning have been successfully applied to plankton recognition (Teigen et al. 2020; Badreldeen Bdawy Mohamed et al. 2022).

Schröder et al. (2018) employed a low-shot learning technique called weight imprinting (Qi et al. 2018) for plankton recognition with a limited amount of labeled training data. The main idea of weight imprinting is to divide the set of all classes into base classes with enough training data and smaller low-shot classes. During the representation learning phase, a CNN is trained to distinguish the base classes with a large amount of labeled training data. In the second phase (low-shot learning), the classifier is then updated with calculated weights to distinguish the smaller low-shot classes. This is done by using appropriately scaled class features of the low-shot classes as their weights, directly allowing the inclusion of classes with only one training image. Guo and Guan (2021) addressed the few-shot learning by supplementing the softmax loss with center loss term (Wen et al. 2016) that forces the samples from the same class close to each other in the deep feature space. The loss function is a weighted sum of the two loss terms and a regularization parameter is used to control the weights.

In the extreme case, the labeled training data are completely absent and unsupervised learning methods are required. Image clustering is the most commonly used unsupervised technique for plankton image analysis. Ibrahim (2020) carried out preliminary experiments on common clustering algorithms such as k-means with phytoplankton data. Image features for clustering were extracted using pretrained CNN models. Coltelli et al. (2014) used various handcrafted image features and self-organizing maps (SOM) for plankton image clustering. Schmarje et al. (2021) proposed a framework for handling semi-supervised classifications of fuzzy labels due to experts having different opinions. The approach is based on overclustering to identify substructures in the fuzzy labels and a loss function to improve the overclustering. The performance surpassed the one of a state-of-the-art semi-supervised method on plankton data. Salvesen (2021); Salvesen et al. (2022) studied deep learning for plankton classification without ground truth labels. The improved feature learning was implemented using DeepCluster, a Generative Adversarial Network (GAN) and a rotation-invariant autoencoder. Despite the potential in unsupervised methods, the gap to supervised learning is still significant.

Hierarchical clustering methods are preferred on plankton data as they have the potential to mimic the taxonomic hierarchy of plankton. Dimitrovski et al. (2012), classification of diatom images is considered as a hierarchical multi-label classification problem and solved by constructing predictive clustering trees that can simultaneously predict all different levels in the taxonomic hierarchy. These trees are then used as an ensemble forming a random forest (RF) to improve the predictive performance. Morphocluster (Schröder et al. 2020) utilizes a semi-automated iterative approach and hierarchical density-based HDBSCAN* (Campello et al. 2015) for plankton image data analysis. To compute image features for the clustering a CNN trained with UVP5/EcoTaxa dataset in a supervised manner was used. The method works iteratively in a semi-automated manner so that clusters are validated by an expert. An improved version of Morphocluster was presented by Schröder and Kiko (2022). Multiple CNN-based feature extractors were trained using different labeled datasets to allow the selection of the most suitable feature extractors for the target data. In addition, an unsupervised approach to learn the plankton image features based on the momentum contrast method (He et al. 2020) was proposed. The idea is to use data augmentation to generate two different instances of the same image and use a loss function that forces the model to learn similar feature representations for both instances. Moreover, two custom clustering methods were proposed: (1) shrunken k-Means, and (2) Partially Labeled k-Means. Due to the iterative clustering process of Morphocluster, only part of the images needs to be clustered in each iteration. Shrunken k-Means utilizes distances to cluster centers provided k-means to discard images that are far from the centers. Partially Labeled k-Means utilizes the label information from the earlier iterations to guide the clustering.

Autoencoders have also been proposed for learning plankton image features for clustering without the label information. The basic idea is to utilize encoder-decoder network architecture where the encoder generates an embedding vector from an image and the decoder tries to reconstruct the original image based on the embedding vector. Such a network can be trained without any labels. Ideally, the encoder learns to compress the essential information from the image into an embedding vector that can then be used for clustering. For example, Salvesen et al. (2020) applied an autoencoder-based approach called Deep Convolutional Embedded Clustering (DCEC) plankton image data. The method employs the CNN-based autoencoder architecture by Guo et al. (2017) and uses k-means to cluster the obtained embeddings. Alfano et al. (2022) proposed a plankton image clustering technique based on variational autoencoders (VAEs). The method utilizes a pre-trained DenseNet without fine-tuning to extract features. Obtained deep image features are then fed to VAE to generate latent space representations. Finally, low-dimensional embeddings are clustered using fuzzy k-means.

Clustering methods are only able to produce unlabeled clusters of images with a similar appearance. Therefore, further analysis is needed to confirm and label the clusters. Schröder et al. (2020) addressed this by introducing an interactive tool where the users revise the obtained clusters, manually correct the hierarchy and annotate the final set of clusters. This semiautomatic approach reduces the manual work needed for data labeling as the expert does not need to annotate every image separately. Goulart et al. (2021) utilized t-distributed stochastic neighbor embedding (t-SNE) to visualize the clusters in two-dimensional space allowing the human expert to quickly see the clusters in the data. Pastore et al. (2020) proposed a full pipeline for environmental monitoring based on plankton image clustering and minimal expert supervision (the expert labels only one image per cluster). CNN was used for image feature extraction and various unsupervised clustering algorithms including K-means, fuzzy K-means, and Gaussian mixture model were compared.

Unsupervised learning has also been applied for pre-training on unlabeled plankton image data. This enables a semi-supervised approach for plankton recognition where an initial set of image features is learned in an unsupervised manner using large volumes of unlabeled data, and the final model is obtained by fine-tuning it on a small amount of expert labeled data. Schanz et al. (2023) proposed to use the SimCLR method (Chen et al. 2020) for unsupervised pre-training. Pastore et al. (2023) applied a customized variational autoencoder for unsupervised feature learning and compression.

As a summary, the most common approaches to tackle the problem of limited amount of labeled plankton image data are data augmentation and transfer learning. Data augmentation is an essential part of practically all modern plankton recognition pipelines based on deep learning, while transfer learning allows to utilize knowledge from another domain to compensate the lack of labeled training data. In the case of extreme scarcity of labeled training data, further modifications to the model training are needed. Typically this means the adoption of regularization techniques that prevent the model to overfit to the training data. Weight imprinting, metric learning, and central loss have been found useful tools in few-shot plankton recognition. If labeled training data is completely missing, clustering or active learning can be utilized. Clustering allows to analyze plankton image datasets in an unsupervised manner, while active learning makes it possible to minimize the amount of expert labeling effort for building a plankton recognition model for future data.

5.2 Challenge 2: High class imbalance

High class imbalance is naturally inherent in many real-world applications and plankton recognition is not an exception. Certain plankton species are considerably more common than others causing the data in typical plankton datasets to be highly imbalanced. This is problematic when it comes to training plankton classification methods. One of the most notable problems connected to the high class imbalance is the catastrophic forgetting where neural network, while learning new information, completely forgets previously learned information. This typically affects the minority classes that are only rarely seen during the training stage causing the network to only learn the necessary image features for the majority classes.

Undersampling is a technique to decrease the level of imbalance by discarding images from the majority classes. In the simplest case, undersampling can be done by randomly selecting a subset of images from majority classes is such way that the resulting training dataset has an equal amount of images in all classes. For example, Lee et al. (2016) reduced the class bias on small-sized plankton classes by randomly sampling images from the classes with more samples than the predefined threshold. Kloster et al. (2020) utilized a similar undersampling technique. Also, more intelligent solutions for undersampling have been suggested in plankton literature. Le et al. (2022) utilized undersampling by filtering combined with cost-sensitive learning to obtain a more balanced dataset for training. Ding et al. (2018) proposed an EasyEnsemble.D algorithm for plankton recognition on highly imbalanced datasets. The basic idea is to sample multiple subsets from majority classes to fully utilize the large data volumes. Each subset is used to train a separate weak classifier with different weights, and the final classification is performed using the ensemble of the weak classifiers. The problem with undersampling is that it reduces the amount of training data which in the case of plankton recognition is typically already limited. Especially, in the presence of rare species, the undersampling alone leads to an extremely small training set.

Oversampling is another technique to reduce the level of imbalance with duplicating samples from the minority classes. The oversampling is typically done using data augmentation, i.e. instead of using identical duplicates, manipulated versions are created to obtain more training data for minority classes. For example, Bochinski et al. (2018), increased the amount of training samples of the smaller classes by mirroring the images horizontally and vertically to counter the imbalance during training.

Xiaoyan (2020) proposed a combination of undersampling and oversampling to address the class imbalance in plankton recognition. This is done by utilizing KA-Ensemble algorithm (Ding et al. 2020) that combines oversampling of the minority class via kernel-based adaptive synthetic sampling (Kernel-ADASYN) and random undersampling of the majority class. The experiments showed increased classification accuracy for the minority class. Liu et al. (2021a) proposed to combine borderline-SMOTE oversampling with Fuzzy C-means clustering-based undersampling for plankton image data. The Synthetic Minority Oversampling TEchnique (SMOTE) (Chawla et al. 2002) synthesizes new samples between the minority class and its nearest neighbor in the feature space. Borderline-SMOTE (Han et al. 2005) improves the method by concentrating on the samples near the class boundaries in order to oversample more significant samples for the minority classes. Fuzzy C-means clustering is utilized to preserve the clusters found in the original data during undersampling.

Another approach among a variety of resampling methods is cost-sensitive learning (Elkan 2001). The method defines a so-called cost-matrix which specifies a reward or a penalty over the classifications of an algorithm. A core idea behind it is similar to resampling but it does not change the prevalence of the training set directly. However, a performance evaluation for an imbalanced plankton set reported by Corrêa et al. (2016) demonstrates only minor improvements for cost-matrix in comparison to SMOTE and resampling.

Another solution to artificially create more image data for training and to reduce the level of imbalance is to utilize generative models capable of generating realistic images with a certain distribution. GANs (Goodfellow et al. 2014) are deep learning models that can be used to generate photo-realistic artificial images with the same statistics as the data they were trained with. This is done by using two models, a generative model and a discriminative model. The generative model generates candidate images usually from random noise. The discriminative model is an image classifier that is given labeled samples from the real set of images and fake images produced by the generative model. The task of the discriminative model is to distinguish real images from fake ones and the task of the generative model is to fool the discriminative model. These two models are trained simultaneously in such a way that the generative model becomes increasingly better at producing realistic fake images and the discriminative model gets increasingly better at recognizing them. GANs have been shown to be able to generate images that are authentic to human observers.

GANs have been utilized also for reducing bias caused by the class imbalance in plankton recognition. Wang et al. (2017) used GAN to generate new example images of minority classes. Furthermore, a method was proposed where the CNN-based plankton recognition model shares the weights with the discriminative model. However, only minor improvement was observed over the baseline recognition models trained on the original data without GAN-based data augmentation. Liu et al. (2018b) proposed a GAN-based curriculum learning strategy. The proposed method contains two stages. First, the model is trained using the original data and then with more complex data consisting of GAN-generated images. Li et al. (2021c) utilized CycleGan (Zhu et al. 2017) for the augmentation of rare taxa, and Khan et al. (2022); Ali et al. (2022) applied DC-GAN (Radford et al. 2015) to augment an algae image dataset. Vallez et al. (2022) compared data augmentation by combining two diatom images from the same class using morphing and image registration methods performing diffeomorphic transformations to generation of synthetic images by a GAN. In this study, mixing images using morphing achieved better results. The fundamental problem of using GANs for image augmentation is that the generated images have the same statistics as the images they were trained with. Therefore, if the GANs are trained using the same data as the recognition model, and the recognition model is able to learn the data distribution from the original data, the generated samples do not necessarily provide additional value for the training. However, some promising results have been obtained on GAN-based augmentation of highly imbalanced datasets (Tanaka and Aranha 2019).

Similarly to the challenge of a limited amount of labeled training data, transfer learning has also been proposed to overcome the class imbalance problem. In a method proposed by (Lee et al. 2016), a balanced dataset is first generated using randomized undersampling, the model is pre-trained on the balanced dataset, and finally fine-tuned using the whole unbalanced plankton image dataset. Wang et al. (2018) introduced a transfer parallel model approach for plankton recognition. The main idea is to avoid the catastrophic forgetting by training two submodels: (1) a model trained on the whole dataset, and (2) a pre-trained model trained only on small classes. Deep features from both of the models are concatenated before the softmax layer. The latter submodel adds good image features for minor class classification that the network could otherwise fail to learn.

Also, modified model architectures have been proposed to address the class imbalance. These include models with increased generalization ability to minority classes. Liu et al. (2018a) applied Deep Pyramidal Residual Network (PyramidNet) (Han et al. 2017) to plankton recognition and shown to improve accuracy on a highly imbalanced dataset. The idea behind PyramidNet is to gradually increase the size of the feature map. This combined with the ResNet style to skip connections causes reduced change of overfitting, and therefore, better generalization ability. Kerr et al. (2020) proposed model fusion to address the class imbalance. The results suggest that combining multiple individually trained CNNs with a common softmax layer improves the accuracy of rare species, consequently providing better overall accuracy on imbalanced data.

As a summary, undersampling and oversampling are the simplest and most widely used approaches to address high class imbalance in plankton image data. Oversampling is typically performed using traditional data augmentation, but also generative approaches such as GANs have been proposed to generate completely new plankton images for the minority classes. Moreover, transfer learning, model fusion, and regularization techniques preventing overfitting have been shown to improve plankton recognition accuracy in the case of highly imbalanced training data.

5.3 Challenge 3: Fine-grained nature of the recognition task

In order to obtain high recognition accuracy on classes with high inter-class similarity such as taxonomically close plankton species, techniques that focus attention on subtle visual differences are needed. The task of recognizing hard-to-distinguish classes from each other is called fine-grained classification. Plankton recognition in most cases can be considered a fine-grained classification task as the fundamental way to improve the overall accuracy of a recognition model is to make it better at recognizing the challenging cases. Despite this, most of the work on plankton recognition does not tackle the challenge directly but instead focuses on comparing different general model architectures on the task. Related to this viewpoint, it has been also studied whether the recognition should be considered as a flat or hierarchical classification task. Boddy et al. (2000) considered misclassifications of phytoplankton as a result from the overlap of feature distributions and grouping of similar species within genera or based on groupings indicated in dendrograms was proposed. Similarly, Fernandes et al. (2009) proposed an approach for balancing the trade-off between the classification performance and number of classes. The model automatically suggests merging of classes based on the statistics evaluated after the classification. The results from taxa recognition of macroinvertebrates by Ärje et al. (2020) showed that humans performed better when a hierarchical classification approach commonly used by human taxonomic experts was used, but when a flat classification approach was used, the CNN was close to human accuracy. To improve the automatic approaches, a few methods focusing especially on the attention mechanism to address the fine-grained nature of the recognition task have been proposed.

Sun et al. (2020) considered fine-grained classification of plankton by proposing an attention mechanism based on Gradient-weighted Class Activation Maps (Grad-CAM) (Selvaraju et al. 2017) to force the CNN to focus on the most informative regions in the image. Grad-CAM was originally developed for visualizing the CNN-based models. It highlights important image regions which correspond to the decision of interest (in this case plankton recognition). Sun et al. (2020) utilized Grad-CAM to detect the regions to focus on, and a feature fusion approach utilizing high-order integration (Cai et al. 2017) is applied to obtain stronger features for those regions. This approach shares similarities with the self-attention module used in the TANet architecture (Li et al. 2019c) for plankton recognition. However, the self-attention module puts larger weights on the important regions, i.e. those regions in the feature map with high activation values. Ito et al. (2023) proposed to use Attention Branch Network model for hierarchical classification of plankton images. This was motivated by the hierarchical structure of the plankton taxonomy. Successful classification at higher levels of the taxonomy simplifies the fine-grained recognition task at the lower levels.

Also other approaches for fine-grained plankton recognition have been proposed. Du et al. (2020) applied Matrix Power Normalized CO-Variance (MPN-COV) pooling layer for second-order feature extraction. The aim is to model the complex class boundaries more accurately than in traditional pooling (e.g. softmax). There is some evidence (Li et al. 2017) that suggests that higher-order information can improve recognition accuracy in fine-grained tasks. Venkataramanan et al. (2021) proposed an improved pipeline tackling inter-class similarity and intra-class variance. The authors suggested alleviating inter-class variance with a metric learning-based approach utilizing triplet loss and mitigating intra-class variance by X-means clustering technique applied to the extracted features. The idea is to cluster the classes with high inter-class variance into multiple clusters and consider these as separate classes. The authors propose a method to find the optimal amount of clusters that minimize both the intra-class variance and inter-class similarity, and this way improve the accuracy of fine-grained plankton recognition. Si et al. (2023) proposed to use a token-selective vision transformer for fine-grained recognition of marine organisms including plankton. The most important tokens are selected layer-by-layer and they focus on distinctive features.

In general, only few papers directly tackling the fine-grained nature of the plankton recognition task exist. These are based on attention mechanisms to find the most important regions in the images allowing the recognition model to focus on the subtle differences between the classes, and contrastive or metric learning that allow explicitly learning the image features that separate the pairs of classes.

5.4 Challenge 4: Domain shift between datasets

Different imaging instruments cause domain shift between plankton data-sets. Domain shift in a wider sense refers to a situation where the distribution of the dataset that is used for training differs from the data where the recognition model is applied. CNN-based models tend to learn image features that are very specific to the distribution of the training data making them notoriously weak at generalizing beyond the domain they were trained on (Gulrajani and Lopez-Paz 2020). This is why most automatic plankton recognition solutions focus on just one imaging instrument. This, however, limits the wider utilization of the methods. Tuning the classification model trained on one dataset to work on another dataset (correcting domain shift between the datasets) is called domain adaptation (Ben-David et al. 2010) and learning a general model that can be applied to any dataset (domain) is called domain generalization (Zhou et al. 2022).

While domain adaptation and generalization have not been widely studied on plankton recognition, there have been works where multiple different plankton image datasets have been utilized to solve the recognition task. Transfer learning and fine-tuning have been utilized as approaches against the differences in datasets. Rodrigues et al. (2018) applied transfer learning using CNNs to obtain a feature extractor that can be used for new datasets. The Kaggle-Plankton dataset was used to train a CNN (source dataset) and an in-house dataset was used as a target dataset to test the suitability of the features. Orenstein and Beijbom (2017) applied a variety of learning schemes to three very different plankton image datasets. The bigger labeled image datasets, IFCB and ISIIS, were used to train CNNs both by fine-tuning and from scratch. Then, the classifiers were used to classify within-domain images directly and as feature extractors for out-of-domain data. Maracani et al. (2023) performed a similar experiment where out-of-domain datasets were used for pretraining and small plankton datasets for fine-tuning. Surprisingly, ImageNet pretraining provided higher accuracy on target datasets than pretraining done on large-scale plankton datasets (e.g. WHOI).

Lumini and Nanni (2019a); Lumini et al. (2020) studied ensembles of different CNN models, fine-tuned on several datasets, with the aim of exploiting their diversity in designing an ensemble of a classifier. The experimental results show that the combination of several CNNs in an ensemble grants a performance improvement compared with a single CNN model.

In Bochinski et al. (2018), two datasets from different biological environments were captured and analyzed. The first dataset was used to analyze the achievable accuracy of the CNN and how the Cost-Effective Active Learning (CEAL) can be used to minimize the number of required annotations. The second dataset was used to examine the generalization ability of the CNN and if the CEAL method can be used to fine-tune the system to adapt to the characteristics of this new data.

Plonus et al. (2021a) suggest using capsule neural networks combined with probability filters to address the dataset shift caused by different plankton imaging instruments. The idea of Capsule neural networks is to form groups of neurons (capsules) that learn the specific properties of the object (e.g. plankton) in the image. The authors argue that the capsule neural networks are less sensitive to the changes in the field conditions and therefore able to adapt to different data distributions. Guo et al. (2022b) proposed a cross-domain few-shot learning model for instrument-agnostic plankton recognition. Similarly to transfer learning, the model is first trained on the source domain with a large amount of labeled training data and then adapted to the target domain using fine-tuning. In addition, graph neural network-based meta-learning is applied to learn a feature distance metric capable of recognizing plankton species in the target dataset with a very limited amount of labeled data.

Domain shifts between the plankton image datasets or imaging instruments have not been widely studied. Most works focus on fine-tuning the recognition models trained on one dataset to new datasets using transfer learning. While the transfer learning reduces the amount of manual labeling needed for new datasets, it does not fully solve the problem of multiple domains. Labeled training data are still needed for all datasets, and the recognition models need to be fine-tuned for each, requiring expertise in machine learning and computing resources. A more general model can be obtained by using ensemble learning with submodels learned on different datasets if labeled training data on each dataset (imaging instrument) is available. More sophisticated approaches to plankton image domain adaptation include the capsule neural networks and meta-learning.

5.5 Challenge 5: Previously unseen classes and unknown particles

Automated plankton imaging instruments capture images of unknown particles and the class (plankton species) composition varies between geographical regions and ecosystems. CNN-based models are known to struggle in open-set settings where the class composition of training data differs from the data for which the trained model is applied. Typical CNN-based classification models tend to classify the images from a new class to one of the known classes often with high confidence, and to include new classes to the models, they need to be retrained. These are major problems for plankton recognition as the plankton species vary between different regions and seasons. Retraining a separate model for each dataset is not feasible. Therefore, there is a need for a recognition model that (1) is able to predict when the image contains a previously unknown plankton species (open-set recognition) and (2) can be generalized to new classes without retraining the whole model.

In the case of plankton recognition, the open-set problem is often formulated as an anomaly detection problem where the model is trained to both correctly classify the known classes and to filter abnormal classes by training the model to produce high and low entropy distributions for the normal classes and abnormal classes respectively. Pastore et al. (2020) proposed a semi-automatic method to handle the previously unseen plankton classes by utilizing anomaly detection combined with expert verification. Both one-class SVM and a new neural network-based method called Delta-Enhanced Class (DEC) detector were considered. The DEC detector utilizes absolute differences between the feature vectors of an input image and random images from a known class as additional input to predict whether the input image is from the known class or anomaly. Varma et al. (2020) proposed \(L_1\)-norm tensor-conformity curation to remove outliers (non-plankton or misclassified images) from the training data. The idea is to measure the conformity of the images using \(L_1\)-norm subspaces (Tountas et al. 2019). Conradt et al. (2022) brought up the high intra-class and low inter-class variation of plankton morphology, and spatio-temporal changes in the plankton community as the main causes for the need to frequently validate the results from automatic recognition. The proposed remedy is a dynamic optimization cycle in which the model is updated based on manual-validation results.

Pu et al. (2021) proposed a loss function that contains three loss terms to detect the anomalies and to maintain the classification accuracy for the images belonging to the normal classes by incorporating the expected cross-entropy loss, the expected Kullback-Leibler (KL) divergence, and the Anchor loss. The model was tested on classes of plankton images containing also bubbles or random suspending particles. Walker and Orenstein (2021) utilized a large background set of images that do not belong to the target classes (classes to be recognized) and hard negative mining to find images that are more likely to cause false negatives. The training set was then complemented with these challenging images to improve the classifier’s ability to recognize when the images are from novel classes. While promising results were obtained on open-set plankton recognition the method requires that a labeled background set is available which limits the usability of the method. Pastore et al. (2022) addressed the unseen classes using anomaly detection method called TailDeTect. The TailDeTect method applies bootstrapping to estimate the mean and standard deviation for each image feature and utilizes this information to analyze if the input sample is out of boundaries for that particular feature. The sample is considered as anomaly if it is out of boundaries for more than predefined number of features. This process is applied separately for each known class similarly to one-class classifiers.

Another approach to tackle the open-set problem is to utilize similarity metric learning. The aim of metric learning is to obtain image embedding vectors that model the similarity between images. It is commonly utilized in person (Ye et al. 2021) and animal re-identification (Nepovinnykh et al. 2020), as well as content-based image retrieval (Dubey 2021), but has been also successfully applied to plankton classification (Teigen et al. 2020; Badreldeen Bdawy Mohamed et al. 2022). A simple approach to implement a recognition method is to construct a gallery set of known species and use the learned similarity metric to compare query images to the gallery images. The similarity in this context corresponds to the likelihood that the images belong to the same class. This further allows defining a threshold value for similarity enabling open-set classification: if no similar images are found in the gallery set, the query image is predicted to belong to an unknown class. Furthermore, new classes can be added by simply including them in the gallery set as the model does not necessarily need to learn class-specific image features.

The most common approaches for deep metric learning include triplet-based learning and classification-based metric learning. The first approach learns the metric by sampling image triplets with anchor, positive, and negative examples (Hoffer and Ailon 2015). The loss function is defined in such a way that the distances (similarity) from the embeddings of the anchors to the positive samples are minimized, and the distances from the anchors to the negative samples are maximized. The second approach approximates the classes using learned proxies (Movshovitz-Attias et al. 2017) or class centers (Deng et al. 2019) that provide the global information needed to learn the metric. This makes it possible to formulate the loss function based on the softmax loss and allows to avoid the challenging triplet mining step.

Teigen et al. (2020) studied the viability of few-shot learners in correctly classifying plankton images. A Siamese network was trained using the triplet loss and used to determine the class of a query image. Two scenarios were tested: the multi-class classification and the novel class detection. A model trained to distinguish between five classes of plankton using five reference images from each class was able to achieve reasonable accuracy. In the novel class detection, however, the model was able to filter out only 57 images out of 500 unknowns.

Badreldeen Bdawy Mohamed et al. (2022) utilized the angular margin loss (ArcFace) (Deng et al. 2019) instead of triplet loss to address the high cost of the triplet mining step. Furthermore, Generalised Mean pooling (GeM) (Radenović et al. 2018) was applied to aggregate the deep activations to rotation and translation invariant representations. ArcFace uses a similarity learning mechanism that allows distance metric learning to be solved in the classification task by introducing the Angular Margin Loss. This allows straightforward training of the model and only adds negligible computational complexity. The metric learning-based method was shown to outperform the model utilizing OpenMax (Bendale and Boult 2016) layer in open-set classification of plankton. One of the main benefits of the method is that it generalizes well to new classes added to the gallery set without retraining. This makes it straightforward to apply the model to new datasets with only partly overlapping plankton species composition. Similar approach was proposed by Yang et al. (2022) who proposed to use supervised contrastive (SupCon) loss instead of ArcFace loss.

Plankton species vary in different locations and seasons, thus, it is common that a recognition model should be adapted to or retrained for the new situation at some point. Retraining a separate model for each situation is infeasible, and continual or online training of the model would be challenging for online monitoring applications. Therefore, an effective remedy would be to treat it as an open-set recognition problem, solve it with the modern methods anomaly detection or metric learning, and take care of the model’s capability to generalize to new data without the need to retrain the whole model.

5.6 Challenge 6: Label uncertainty

The plankton image label uncertainty is caused by the difficulty of manually recognizing the species from low-quality images with limited resolution, human error, and high costs preventing the repetition of the manual annotation by multiple experts. Culverhouse et al. (2003) identified four main reasons for the incorrect labeling of plankton images: (1) the limited short-term memory of humans, (2) fatigue, (3) recency effects, i.e., labeling is biased towards the most recently seen labels, and (4) positivity bias, i.e., labeling is biased by the expert’s expectations to the content of sample. Labels provided by sixteen human experts (marine ecologists and harmful algal bloom monitoring specialists) on microscopy images of dinoflagellates (6 classes) were analyzed. The results showed that only 67–83% self-consistency and 43% consensus between experts was obtained. Experts who where routinely labeling the selected classes were able to achieve 84–95% labeling accuracy. Culverhouse (2007) brought up several important points related to labeling algae. The presented performance figures do not represent the state-of-the-art of automatic approaches, but improvements would be beneficial for both alternatives. Human expert judgements would benefit from peer review and inter-expert calibration to remove human bias. To improve the automatic solutions, the errors of both man and machine would require further attention. Global reference databases with validated samples and representative coverage of the morphological and physiological characteristics in nature would be beneficial for training and evaluation purposes. In addition, Solow et al. (2001) noted that the taxonomic counts of classified individuals are biased when there are errors in classification. A straightforward method for correcting for the bias was proposed based on the classification probabilities of the classifier.

Image filtering has been proposed to address label uncertainty in plankton image data. The idea is to discard images for which the recognition model is uncertain, and therefore, more likely to produce erroneous labels. For example, Faillettaz et al. (2016) utilized a probabilistic RF for classification, and obtained class probabilities were used to detect and ignore images for which the classifier is uncertain. Luo et al. (2018), Plonus et al. (2021a), and Kraft et al. (2022b) utilized similar approach for CNN-based recognition models. Luo et al. (2018) used a separate fully annotated validation set to set class-specific probability thresholds for filtering. Plonus et al. (2021a) proposed a pipeline for tailoring filtering thresholds to the research question of interest by allowing to select between high precision and high recall. Kraft et al. (2022b) evaluated a CNN-based model with class-specific probability thresholds on operational use.

Schanz et al. (2023) proposed a novel loss function that measures the Kullback-Leibler divergence between the model’s output distribution over classes and the distribution of expert labels. This allows for training on multiple expert labels that can be conflicting, leading to a model that can estimate the label uncertainty.

Related to the label uncertainty, quantification methods have been proposed for plankton image data analysis. The basic idea is to estimate the class distribution directly. While mislabeled samples cause noise to the training data for classification methods, the class distributions are often close to correct. Sosik and Olson (2007) used a quantification method to estimate the abundance of different taxonomic groups of phytoplankton. Utilizing a combination of image feature types including size, shape, symmetry, and texture characteristics, plus orientation invariant moments, diffraction pattern sampling, and co-occurrence matrix statistics proposed. Statistical analysis was used to estimate category-specific misclassification probabilities for accurate abundance estimates and for quantification of uncertainties in abundance estimates. Beijbom et al. (2015) analyzed several quantification methods on a time-series dataset of plankton samples. These included unsupervised and supervised quantification. In unsupervised quantification, the dataset shift is assumed to be a pure class-distribution shift. Alternatively, the dataset shift is assumed to be ‘small’ and the unlabeled set of target samples is used to align the internal feature representation of a machine learning algorithm. In supervised quantification, no explicit assumptions are made on the dataset shift, but it is assumed that a small amount of samples are available in the target domain. González et al. (2017) proposed a methodology to assess the efficacy of learned models, which takes into account the fact that the data distribution (the plankton composition of the sample) might vary between the training phase and the testing phase. Their approach used validation-by-sample. They proposed using the sample as the basic unit instead of the individuals to predict the abundance of the different plankton groups. Thus, model assessment processes require groups of samples with sufficient variability to provide precise error estimates. González et al. (2019) used a transfer learning approach where deep image features as input for the quantification algorithm to estimate the distribution of each class in an unknown water sample was proposed. Orenstein et al. (2020a) proposed a semi-automatic pipeline where a small subset of images were manually labeled to estimate the dataset shift and use this information to correct the quantification estimate.

Supervised machine learning and particularly the performance evaluation of a recognition model relies on the correctness of the class labels. However, visual recognition of a number of plankton species from low-quality images is difficult and using expert panels becomes practically infeasible if the aim is to produce large datasets. The proposed remedies include exclusion of images that have high label uncertainty or focusing on the actual quantity of interest if it is not plankton recognition. Alternative ways to solve this challenge would be to focus on few-shot learning with ground truth validated by an expert panel and pay special attention to model generalisability, or to use generative models.

5.7 Challenge 7: Large image size variation

Most plankton datasets have extreme variation in image size. Fig. 7 shows example images obtained using Imaging FlowCytobot (IFCB). Typical CNN-based image classifiers require the input image to have a predefined size. Therefore, image resizing has been used as a pre-processing step for datasets with varying height and width of images (e.g. Dai et al. 2016b; Kuang 2015).

Fig. 7
figure 7

Plankton images with different sizes and aspect ratios. (Bureš et al. 2021)

On a general level, the resizing can be done in two ways: by forgoing aspect ratio (e.g. Al-Barazanchi et al. 2015a; Sánchez et al. 2019b) or by maintaining the aspect ratio (e.g. Dai et al. 2016a; Correa et al. 2017; González et al. 2019). In the first approach, stretching is needed for images whose aspect ratio does not match the target aspect ratio. This will change the shape of the objects in the image which may affect the feature extraction or learning. In the second approach, images are typically resized based on the length of their longest side and padded with a single color to make the image size correct. Eerola et al. (2020) evaluated various ways to implement the padding and padding with the mode of the image (the most common color in the image typically corresponding to the background color) produced the best results on IFCB data. Both approaches (forgoing and maintaining aspect ratio) have been utilized in plankton recognition. However, there exists little comparison between them. Dai et al. (2016a) tested various resizing methods were tested on zooplankton images and the best accuracy was obtained by maintaining the aspect ratio while scaling. On the other hand, Jindal and Mundra (2015) found little to no difference on performance between the approaches despite images appearing distorted when forgoing aspect ratio.

Various other ways to obtain fixed-size images have been proposed. In the method proposed by Ho et al. (2018), a fixed input image size was chosen and the images were either cropped or padded with zeros to adjust them to the correct size. Schröder et al. (2018) proposed to crop the images to their tight bounding box and pad to a square with a minimum edge length of 128 pixels. Images larger than 128 pixels were shrunken to the same size. Ellen et al. (2019) resized images larger than the target size thus losing some detail. Images smaller than the target size were resized by padding and, therefore, the object size remained the same. Lumini and Nanni (2019a, 2019b); Lumini et al. (2020) compared the two different strategies: (1) resizing all images to a common size and (2) resizing only images that were larger than the input size and using padding for the smaller images. The results showed that the first method produced a better classification result in most of the datasets and models.

All methods to produce fixed-size images from original plankton images with a large size variation result in some degree of information loss or image distortions. Information on the size of the plankton is lost during the resizing, small details disappear if images are heavily downscaled, and only part of the object is seen if cropping is used. Ellen et al. (2019) partially solved this problem by providing the size information as metadata (additional features) for the classifier while still using resized versions of images as the main input for the CNN. Metadata is used as an input for the network besides image data, and they are processed independently by separate parts of the network. The outputs of both subnetworks are concatenated together and processed by fully connected layers. Results showed that metadata was useful for classification accuracy.

To truly solve the problem with the varying image size and aspect ratio, the CNN architecture needs to be modified so that it can process images with multiple sizes. This can be achieved, e.g. by combining scale-invariant and scale-variant features to devise a multi-scale CNN architecture (Van Noord and Postma 2017). Py et al. (2016) proposed an inception module that allows to use multiple scaled versions of the original image with different sizes as the input for CNN. By selecting different strides for each scale, the computed feature maps have the same size for all scales and can be concatenated to a single set of multi-scale features. The proposed method was shown to outperform the method with a single fixed-size input.

Bureš et al. (2021) compared various modifications of the baseline CNN on plankton recognition with high variation in image size. These include Spatial Pyramid Pooling (SPP) (He et al. 2015), using image size as metadata, patch cropping and multi-stream CNNs. SPP allows the training of a single CNN with multiple image sizes in order to obtain higher scale invariance by pooling the features produced by the convolutional layer to a fixed-length vector required by the fully connected layers. The metadata was used as described by Ellen et al. (2019). The patch cropping technique divides images into fixed-size patches that are classified separately. The final recognition is done by averaging the resulting score vectors. Multi-stream CNN utilizes a similar approach but uses multiple different networks trained for different image sizes and aspect ratios. The best plankton recognition accuracy was obtained using a multi-stream network combining two models with different input aspect ratios and patch cropping.

Most plankton datasets have significant variation in image sizes and aspect ratios. Common CNN-based image classifiers require that the input images have a constant size. In this case, image resizing is used and it is necessary to consider what to do with the aspect ratio and whether metadata about the image size provides an advantage when complementing the fixed-size images. However, a more general remedy would be to use a multi-scale CNN with an appropriate architecture as the recognition model.

5.8 Challenge 8: Low or varying image quality

To improve the classification accuracy on low-quality images various preprocessing steps have been proposed. These include discarding bad quality images (Raitoharju et al. 2016), image segmentation (Keçeli et al. 2017), and denoising (Cheng et al. 2019).

Low quality images can be discarded in different ways. Raitoharju et al. (2016) manually removed low-quality images from the dataset before training the recognition model. Moreover, the remaining images were cropped to remove artifacts mainly appearing close to image borders. Coltelli et al. (2014) filtered out out-of-focus images before the feature extraction. The out-of-focus detection was done by fitting color histograms in a GMM. If the distribution contained two components (background and plankton), the image was considered to be in-focus.

Some studies suggest segmenting the images as a preprocessing step to discard non-plankton pixels from the images. For example, Keçeli et al. (2017) used Otsu’s thresholding method (Otsu 1975) for segmentation and pixels outside the obtained segmentation map are set to zero.

Cheng et al. (2019) applied texture enhancement together with background suppression before the classification step. Enhanced images were shown to produce a slightly higher recognition accuracy than the images without enhancement. Ma et al. (2021) proposed to use modern CNN-based super-resolution techniques to improve the plankton image quality. The EDRN super-resolution architecture (Lim et al. 2017) was combined with the contextual loss (Mechrez et al. 2018), and was shown to produce high-quality images. Guo et al. (2022a) proposed a deep learning-based colorization method to address the loss of the critical color information due to imaging. However, the effect of improved image quality on plankton recognition accuracy was not assessed in neither of the studies. Also contrast limited adaptive histogram equalization (CLAHE) has been proposed to improve the contrast of plankton image data Geronimo et al. (2023). Lang et al. (2022) addressed the image quality issues on holographic imaging via image fusion.

Many real-world computer vision applications have to deal with low-quality images and plankton recognition is no exception. A wealth of image preprocessing approaches exist and in the case of plankton images, at least exclusion of bad images, denoising and image segmentation have been proposed. A more profound way would be to adopt image reconstruction methods, but from the practical perspective of plankton recognition, the simpler methods can be considered as sufficient and data augmentation is commonly used to introduce additional variation to the data.

5.9 Challenge 9: Massive amount of data

While most of the challenges are connected to training and developing plankton recognition models, the modern imaging devices with high output rates introduce a challenge also for the model deployment phase. Massive data volumes obtained by modern imaging instruments motivate to develop computationally efficient solutions that are able to analyse data in real time. However, the computation time is rarely considered in plankton recognition literature. Most works related to the challenge consider lightweight CNN architectures. For example, shallow TANet (Li et al. 2019c) was shown to outperform competing methods in computing time without sacrificing accuracy on the Kaggle dataset.

Zimmerman et al. (2020) proposed an embedded system for in situ deployment of plankton microscope with real-time recognition system. Due to the limited computation resources and computation time limitations, CNN-based recognition methods were considered unsuitable and a faster feature-engineering based approach was proposed with reduced recognition accuracy. Yuan et al. (2023) applied the edge computing with an AI chip to establish real-time on-site analysis of IFCB data.

The computation time is an especially big issue with holographic imaging that traditionally relies on computationally heavy reconstruction operations to process the raw data. To address this end-to-end CNN methods for plankton recognition that take the raw holographic data as input have been proposed (Guo et al. 2021a; Zhang et al. 2021; Barua et al. 2023). This way the reconstruction step can be completely avoided. Guo et al. (2021a) and Zhang et al. (2021) showed that CNNs are able to learn the image features for the plankton recognition from the raw data speeding up the processing significantly.

Online monitoring of plankton with modern imaging equipment produces huge amounts of images. The related image analysis requires either high-performance computing (HPC) resources in the cloud or local (edge) computing with shallow CNN architectures. In most cases, the recognition model training has to be performed in a HPC environment after which at least the lightweight models can be deployed for local execution.

6 Summary and future directions

In this paper, a comprehensive survey of challenges and existing solutions for automatic plankton recognition was provided. We identified nine challenges that complicate the introduction of automatic plankton recognition methods to operational use: (1) the limited amount of labeled training data for less common species, (2) large class imbalance, (3) fine-grained nature of the recognition task, (4) domain shift between imaging instruments, (5) presence of previously unseen classes and unknown particles, (6) uncertainty in expert labels, (7) large variation in image size, (8) low or varying image quality, and (9) massive data volumes. While most of the considered challenges are common in a wide variety of machine learning applications, plankton recognition has its specific characteristics including highly imbalanced image datasets, extreme variation in image size, limitations in image quality, and a shortage of qualified experts to visually annotate the images.

Figure 8 shows a flowchart summarizing the challenges and approaches to solve them. Given a new plankton image dataset, the flowchart provides a simple pipeline to identify the problems related to the dataset as a series of yes-no questions. Furthermore, references to the sections in this paper providing the detailed descriptions are provided to find the existing techniques to tackle the problems and to automate the analysis of the dataset.

Fig. 8
figure 8

Summary flowchart of challenges and solutions

Some of the challenges, especially the limited amount of labeled training data, have been rather extensively studied. While this problem cannot be considered solved, relatively high classification accuracies have been obtained with limited amounts of training images for certain classes. On the other hand, some of the other challenges have not been widely considered in plankton recognition literature. These include the domain shift between different image sets, presence of previously unseen classes and unknown particles, uncertainty in expert labels, and massive data volumes. The reasons for this vary. Most of the research has focused on improving classification accuracy and computation time has not been seen as an issue. Furthermore, the majority of the method development has been done for a fixed set of species and one imaging instrument, thus, there has been no need to address the domain shift or open-set problem.

The large variation in size and appearance of plankton has a notable effect on how challenging the recognition task is depending on what type of plankton is considered. While the type of plankton should be taken into account when designing handcrafted image features for the recognition task, modern feature learning approaches (CNN and ViT) are more general and can often be applied without the need for customized solutions for different plankton types. A notable exception for this are species that are taxonomically close to each other and/or have reduced size, for which fine-grained recognition techniques are needed. The larger size groups are somewhat overpresented in plankton recognition studies, but the existing literature covers a wide variety of different size groups and plankton types. Table 6 in Appendix A summarizes the prevalence of different plankton types and size groups considered in plankton recognition literature.

One notable problem in plankton recognition is the lack of publicly available general-purpose plankton image datasets with an evaluation protocol making it possible to compare different plankton recognition methods in a fair and reliable manner. The vast majority of the research either has focused on private in-house datasets or is based on custom evaluation protocol and dataset splits on publicly available datasets. This makes it impossible to compare the accuracies between different studies making it challenging to select the best practices for future research. This slows down the progress in the plankton recognition method development. Therefore, there is a need for a publicly available plankton dataset with a predetermined evaluation protocol and preferably multiple subsets captured with different imaging instruments to allow quantitative evaluation of the advances in general (device-agnostic) plankton recognition.

Another important problem limiting the wider utilization of automatic plankton recognition is the difficulty of collecting training images to exhaust all the possible classes. It is not realistic to construct a labeled training set consisting of all the plankton species and non-plankton particles that the imaging instrument is capable of capturing in a certain location. Moreover, varying plankton species composition between different geographical regions and ecosystems limits the possibility to apply traditional recognition models to new locations and datasets. Even a classification model developed and trained for one imaging instrument and one geographic location struggles if new species appear, for example, due to seasonal changes. The remedy for this is open-set recognition together with new class discovery methods. Open-set models are able to identify when the images belong to previously unseen classes and either reject them or process them further by, for example, clustering. Such techniques have potential to enable robust open-world plankton recognition systems. Open-set recognition is an active research topic in machine learning [see, for example, Geng et al. (2020)].

The massive volumes of unlabeled data produced using the modern imaging instruments motivate the use of semi-supervised learning techniques to tackle the challenges related to the limitations in labeled training data. One way to achieve this is to utilize unsupervised and self-supervised learning for pre-training of image features on unlabeled data. In self-supervised learning, the data itself is used to generate the supervisory information to guide the training. Typically, this is done by generating augmented versions of the images to obtain image pairs that have the same label. Image features learned this way can then be fine-tuned for the target dataset with a small amount of labeled training data using transfer learning.

Large variation between plankton image datasets with different species compositions and imaging instruments can be considered not only a challenge but also an opportunity. While it is very difficult to develop one general-purpose algorithm for imaging instrument-agnostic plankton recognition, modern domain adaptation methods have the potential to enable the joint utilization of different datasets. This would allow adapting the classification model to new datasets with a reasonable amount of manual work. Domain adaptation has already been successfully applied to various other machine learning applications, such as general object recognition (Wilson and Cook 2020). Domain adaptation can be considered a special case of transfer learning that mimics the human vision system and utilizes a model trained in one or more source domains to a different (but related) target domain. Domain adaptation can be utilized to reduce the effect of a large domain shift between different datasets and the lack of labeled training data.

The relatively large pool of different plankton image datasets motivates to further utilize domain generalization and meta-learning to obtain an imaging instrument agnostic recognition model. In meta-learning, multiple datasets and tasks are used to “learn how to learn” the recognition model. The idea is to automate the creation of the entire machine learning pipeline end-to-end including the search for the model architecture, hyperparameters, and learning the model weights. Domain generalization refers to learning domain-independent (in this case imaging instrument-independent) feature representations that can be then applied to any dataset. Domain generalization has a wide variety of different applications and it has become an increasingly studied problem in machine learning [see the recent survey in Wang et al. (2022a)]. Recent progress in such methods has opened novel possibilities to aim towards a universal plankton recognition system that is able to adapt to different environments, with dramatically different plankton populations and varying imaging instruments, promoting the wider utilization of automatic plankton recognition for aquatic research.