Deep neural networks approach to microbial colony detection -- a comparative analysis

Counting microbial colonies is a fundamental task in microbiology and has many applications in numerous industry branches. Despite this, current studies towards automatic microbial counting using artificial intelligence are hardly comparable due to the lack of unified methodology and the availability of large datasets. The recently introduced AGAR dataset is the answer to the second need, but the research carried out is still not exhaustive. To tackle this problem, we compared the performance of three well-known deep learning approaches for object detection on the AGAR dataset, namely two-stage, one-stage and transformer based neural networks. The achieved results may serve as a benchmark for future experiments.


Introduction
The ability to automatically and accurately detect, localize, and classify bacterial and fungal colonies grown on solid agar is of wide interest in microbiology, biochemistry, food industry, or medicine. An accurate and fast procedure for determining the number and the type of microbial colonies grown on a Petri dish is crucial for economic reasons -industrial testing often relies on proper determination of colony forming units (CFUs). Conventionally, the analysis of samples is performed by trained professionals, even though it is a time-consuming and error-prone process. To avoid these issues, automated methodology based on artificial intelligence can be applied.
A common way of counting and classifying objects using deep learning (DL) is to first detect them and then count the found instances distinguishing between different classes. We compared the results of microbe colonies counting using selected detectors belonging to different classes of neural network architectures, namely two-stage, one-stage, and transformer based models Our experiments were conducted on the Annotated Germs for Automated Recognition (AGAR) dataset with higher-resolution subset [11], which consists of around 7k annotated Petri dish photos with five microbial classes. This paper focuses on setting benchmarks for the detection and counting tasks using stateof-the-art (SoTA) models.

Object detection
Object detection was approached using two major types of DL-based architectures, namely two-stage and one-stage models. Two-stage detectors find classagnostic object proposals in the first stage, and in the second stage the proposed regions are assigned to the most likely class. They are characterised by high localization and classification accuracy. The largest group of two-stage models are Region Based Convolutional Neural Networks (R-CNN) [6], whose main idea is based on extracting region proposals from the image. Over the years, the networks from this family have undergone many modifications. In the case of Faster R-CNN [16]architecture, a Region Proposal Network (RPN) was used instead of the Selective Search algorithm. This allows for significant reduction of the model's inference time. In order to reduce issues with over-fitting during training, Cascade R-CNN [2] was introduced as multi-stage object detector, which consists of multiple connected detectors that are trained with increased intersection over union (IoU) thresholds. A year later, authors of Libra R-CNN [12] focused on balancing the training process by IoU-balanced sampling, balanced feature pyramid, and balanced L1 loss. In the following years, researchers used Faster R-CNN while replacing its backbone (CNN used as feature extractor) with newer architectures. The most recent concept is Composite Backbone Network V2 [9] (CBNetV2), which groups multiple pre-trained backbones of the same kind for more efficient training.
On the other hand, the one-stage architectures are designed to directly predict classes and bounding box locations. For this reason, one-stage detectors are faster, but usually have relatively worse performance. Single-stage detection was popularized in DL mainly by You Only Look Once models (YOLO v1 [13], 9000 [14], v3 [15], v4 [1]), primarily developed by Joseph Redmon. Recently, the authors of YOLOv4 [1] have enhanced the performance of YOLOv3 architecture using methods such as data augmentation, self-adversarial training, and class label smoothing, all of which improve detection results without degrading the inference speed. Moreover, the authors of EfficientDet [17] introduce changes which contribute to an increase in both accuracy and time-performance of object detection. The main proposed changes include using weighted Bidirectional Feature Pyramid Network (BiFPN), compound scaling, and replacing the backbone network with EfficientNet [18]. EfficientNet as a backbone connects with the idea of scalability from compound scaling, which allows the model to be scaled to different sizes and to create a family of object detectors for specific uses.
Additionally, the transformers have recently become a next generation of neural networks for all computer vision applications, including object detection [20,8,5]. The interest in replacing CNN with transformers is mainly due to their efficient memory usage and excellent scalability to very large capacity net-works and huge datasets. The parallelization of transformer processes is achieved by using an attention mechanism applied to images split into patches treated as tokens. The utilization of the transformer architecture to generate predictions of objects and their position in an image was first proposed in DEtection TRansformer (DETR) network [3]. The architecture uses a CNN backbone to learn feature maps, then feeds transformer layers. In comparison to DETR, Deformable DETR [20] network replaces self-attention in the encoder and cross-attention in the decoder with multi-scale deformable attention and cross-attention. Deformable attention modules only attend to a small set of key sampling points around a reference point which highly speeds up the training process. The recently introduced Cross-Covariance Image Transformers (XCiT) [5] concept is a new family of transformer models for image processing. The idea is to use a transformer based neural network as a backbone for two-stage object detection networks. XCiT splits images into fixed size patches and reduces them into tokens with a greater number of features with the use of a few convolutional layers with Gaussian Error Linear Units (GELU) [7] in between. The idea behind the model is to replace self-attention with transposed attention (which is over feature maps instead of tokens).

AGAR dataset
The AGAR dataset [11] contains images of microbial colonies on Petri dishes taken in two different environments, which produced higher resolution and lower resolution images. The differences are between the lighting conditions and apparatuses. Higher resolution images, which were used in our studies, can be divided into bright, dark and vague subgroups. On the other hand, considering the number of colonies, samples can be defined as empty, countable and uncountable. The dataset includes five classes, namely E.coli, C.albicans, P.aeruginosa, S.aureus, B.subtilis, while annotations are stored in json format with the information about the number and type of microbe, environment and coordinates of bounding boxes.
In this paper, we present results of experiments performed using a subset of the AGAR dataset, which consists of 6990 images in total. In our case only higher resolution (mainly 4000×4000 px), dark and bright, without vague, samples with countable number of colonies were chosen. Firstly images were split into train and validation subsets (the same for each experiment), and then divided into 512 × 512 px patches as described in [11]. At the end-in the test stage-whole images from validation subset of the Petri dish were used (for detailed description of the procedure see Supplementary materials from [11]).

Benchmarking methodology
We compared the performance of selected models using several metrics: architecture type and size, inference time, and detection and counting accuracy.
During time measurements, the inference was executed on GeForce GTX 1080 Ti GPU using the same patch with 6 ground truth instances. The models were first loaded into memory, then inferred 100 times sequentially (ignoring the first 20 times for warming up) to calculate averaged time and its standard deviation for each model separately.
As to detection results, the detector performance was evaluated twofold -by measuring the effectiveness of detection and counting. As an evaluation metric for colony detection, we rely on the mean Average Precision (mAP), to be precise mAP@.5:.95, averaged over all 5 classes. The efficiency of colony counting was measured based on Mean Absolute Error (MAE), and Symmetric Mean Absolute Percentage Error (sMAPE).
With the growing popularity of DL, many open source software libraries implementing SoTA object detection algorithms emerge. Results provided for Faster R-CNN and Cascade R-CNN were taken from [11] for comparison purposes. Similarly, in our experiments we relied on MMDetection [4] framework (Libra R-CNN, CBNetV2, Deformable DETR, XCiT), Alexey Bochkovskiy's Darknet-based implementation of YOLOv4 [1], and Ross Wightman's PyTorch [19] reimplementation of official EfficientDet's TensorFlow implementation. To perform model training, we used the default parameters as for COCO dataset in the above mentioned implementations. In case of YOLOv4, we changed the input size to 512 × 512 px in order to match the size of the generated patches. We used pretrained backbones in all experiments. Traditional two-and one-stage networks were trained with Stochastic Gradient Descent (SGD) optimizer, as opposed to Transformer based architectures, where AdamW [10] optimizer was used. The values of initial learning rate vary between 10 −3 and 10 −5 for each model. All networks were trained until loss values saturated for validation subset. We also chose commonly used augmentation strategies of selected models, like flips, crops and resizes of images.

Results
Mean averaged precisions presented in Table 1 are averaged over the all microbe classes. Calculated value of mAP@.5:.95 varies between 0.491 and 0.529. The most efficient results in terms of accuracy and inference speed were achieved for YOLOv4 architecture. On the other hand, transformer based architectures present slightly worse performance. Some interesting cases were presented in Fig. 1. The selected image presents the same microbial species (P.aeruginosa), which forms two different sizes of colonies due to agar inhomogeneities, making detection even more challenging. Labeled small contamination is not perceived by all models (transformer based and EfficientDet-D2), and some of them (YOLOv4, Deformable DETR) also have problems with precise localization of blurred colonies. Two-stage detectors have a tendency to produce some excessive predictions. Table 1. Benchmarks for tested models on the higher-resolution subset of AGAR dataset. The model size is given in terms of number of parameters (in millions). In case of XCiT model number of backbone's parameters is given in brackets.  The performance of selected architectures for microbial counting is presented in both Table 2 and Fig. 2, while Table 3 shows all five microbial species separately. In general, all detectors perform better for microbes that form clearly visi-ble, separate colonies. The biggest problem with locating individual colonies was observed for P.aeruginosa, , where the tendency for aggregation and overlapping is the greatest. Overall, the best results were obtained for the YOLOv4 model, where the predicted count of microbial colonies is the closest to ground truth in range from 1 to 50 instances (see Fig. 2) -the most operable scope for industrial applications. The worst performance was observed for the EfficientDet-D2 model -where small instances of microbial colonies were omitted (not localized at all), which may be caused by resizing patches to fit the input layer size. Very low contrast between the agar substrate and the colony (bright subset of AGAR dataset) is an additional problem here.

Conclusions
In the conducted studies, we analyzed eight SoTA deep architectures in terms of model type, size, average inference time, and the accuracy of detecting and counting microbial colonies from images of Petri dishes. A detailed comparison was performed on AGAR dataset [11].
The presented results do not differ much between the different types of architectures. It is worth noting that we chose rather smaller, typical backbones for the purposes of this comparison to create a baseline benchmark for different types of detectors. It appeared that the most accurate (mAP = 0.529) and the fastest model (17 ms) is one-stage YOLOv4 network making this model an excellent choice for industrial applications. Two-stage architectures of different types and kinds achieved moderate performance, while transformer based architectures gave the worst results. EfficientDet-D2 turned out to be the smallest model in terms of the number of parameters.
Our experiments yet again confirm the great ability of DL-based approaches to detect microbial colonies grown in Petri dishes from RGB images. The biggest challenge here is the need to collect large amounts of balanced data. To train detectors in a fully-supervised manner, data must be properly labelled. However, identification of abnormal colonies grown in a Petri dish can be difficult even for a trained specialist. Additionally, variable lighting conditions can make detection even more difficult, which can be observed in our case for EfficientDet-D2 prediction for unrepresented bright samples.

Acknowledgements
Project "Development of a new method for detection and identifying bacterial colonies using artificial neural networks and machine learning algorithms" is cofinanced from European Union funds under the European Regional Development Funds as part of the Smart Growth Operational Program. Project implemented as part of the National Centre for Research and Development: Fast Track (grant no. POIR.01.01.01-00-0040/18).