A detection-regression based framework for fish keypoints detection

Dong, Junyu; Shangguan, Xinyu; Zhou, Kaiming; Gan, Yanhai; Fan, Hao; Chen, Long

doi:10.1007/s44295-023-00002-3

A detection-regression based framework for fish keypoints detection

Research Paper
Open access
Published: 20 October 2023

Volume 1, article number 9, (2023)
Cite this article

Download PDF

You have full access to this open access article

Intelligent Marine Technology and Systems Aims and scope Submit manuscript

A detection-regression based framework for fish keypoints detection

Download PDF

Junyu Dong^1,2,
Xinyu Shangguan²,
Kaiming Zhou²,
Yanhai Gan¹,
Hao Fan¹ &
…
Long Chen³

2414 Accesses
3 Citations
Explore all metrics

Abstract

Applying computer vision technology in aquaculture can improve the efficiency of fish detection and health monitoring as well as optimize aquaculture management and profit. Keypoints on fish bodies are important biological indicators that can be used to calculate the individual size, mass, and behavior. However, only a few relevant studies have been conducted in this regard, and they mainly focus on detecting keypoints for stereo matching. Traditional keypoint detection methods exhibit low efficiency, poor accuracy, and weak robustness in underwater environments. Accordingly, this study proposes a new method based on object detection and point regression models to locate fish keypoints. First, individual fish are detected by employing a commonly used object detection model, YOLOv5. The detection accuracy is further improved by enhancing the network neck. In the second stage, a deep learning model for locating fish keypoints is constructed by implementing weight allocation and distribution-aware strategy in the matched left and right bounding boxes to improve on the previous work of Lite-HRNet, which was originally designed for capturing human body keypoints. The experimental results show that the proposed method can effectively detect individual underwater fish and accurately estimate the keypoints. The source code and the labeled datasets for fish detection and keypoint location are provided. The code is available at https://github.com/oucvisionlabsanya/fish_keypoint_detection.git.

Using channel pruning–based YOLOv5 deep learning algorithm for accurately counting fish fry in real time

Article 08 July 2024

Computer Vision Models in Intelligent Aquaculture with Emphasis on Fish Detection and Behavior Analysis: A Review

Article 05 September 2020

Robust Fish Enumeration by Multiple Object Tracking in Overhead Videos

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Fish are a crucial part of people’s diets owing to their tender flesh, high-quality protein, and other nutrients. However, in addition to the low level of automation, there is a lack of use of Artificial Intelligence in fish aquaculture. A substantial amount of manual intervention is constantly required in key aspects of fish aquaculture such as feeding and health monitoring. This manual method cannot monitor fish growth in real time, and there is a lag in the information obtained, making timely adjustments to the aquaculture procedures difficult. Computer vision technology has been widely used in recent years. The use of intelligent vision methods in fish aquaculture not only saves human and material resources but also provides more timely, accurate, and stable data, which considerably improves production efficiency and management. Capturing images in the aquaculture pond in real time through underwater cameras enables contactless collection of fish data, which can reduce the stress and physical damage to the fish population caused by traditional methods of data collection. By increasing the monitoring time to several years in the aquaculture period, fish farms can save considerably on labor costs and achieve better economic benefits.

Keypoints on fish bodies, such as certain points on the mouth, eye, anterior end of dorsal fin, posterior end of dorsal fin, and tail fin, are important biological indicators and can be used to calculate individual size, mass, and behavior. Therefore, developing accurate keypoint detection methods for fish bodies is necessary. With the development of deep learning technology and the deepening of neural network models, the accuracy and efficiency of object detection have been considerably improved. This can enable more convenient and accurate services and facilitate a better understanding of the morphological features via recognition, location, and classification of objects in images or videos. However, locating fish keypoints has been a challenging problem, and few related studies are available in the literature. Recent studies have mainly focused on detecting keypoints for stereo matching (Suo et al. 2020; Lin et al. 2021). The large variability in fish morphology and the influence of factors such as light and shading make accurately locating fish keypoints difficult. Solving this problem can provide considerable benefits for fish production and conservation.

Currently, studies on fish keypoint detection remain scarce. Keypoint detection is more commonly applied in the fields of facial recognition and human pose estimation. Therefore, borrowing ideas from human pose or keypoint estimation methods can be useful with respect to fish keypoint detection. Traditional keypoint detection is achieved via feature operator filtering, template matching (Li et al. 2021), and corner point detection (Zhang et al. 2007). Yang et al. (2020) used machine learning technology to achieve fast detection of three-dimensional surface keypoints, which substantially improved the detection efficiency and accuracy. Wu and Guan (2019) successfully constructed an efficient face recognition model using face keypoint information combined with Long Short-Term Memory. Toshev and Szegedy (2014) conducted the first study on a single-object two-dimensional (2D) human pose and proposed a multistage, iterative, deep convolutional neural network structure based on coordinate attribution. Using the network for the feature extraction of keypoints followed by coordinate regression, iterations were performed to obtain the locations of keypoints and 2D pose estimation. However, because this method directly regresses the spatial locations of keypoints, model convergence becomes difficult. Consequently, the performance cannot be guaranteed, and the obtained results are difficult to migrate to other scenarios. Tompson et al. (2014) proposed a multistage convolutional neural network combined with a 2D pose estimation method and a Markov random-field model. This model uses a multistage cascade structure to extract features from keypoints and regress the feature information to obtain the location of keypoints. However, it still has issues regarding poor generalization performance due to coordinate regression.

The abovementioned two-dimensional human pose estimation algorithm in the network primarily uses a supervised training method for feature extraction of keypoints. A high-quality dataset for keypoints is required owing to differences in the spatial distribution and the number of keypoints for different objects. Moreover, migrating the learning by directly using other methods is difficult. When relevant training samples are created, the keypoints of the objects should be identified and labeled according to the spatial requirements. For rigid objects such as fixed-wing vehicles, a strong and stable correlation between the keypoints and pose is observed. For nonrigid objects, such as human bodies and fish, there are more keypoints, which can move freely. Therefore, the pose structure of nonrigid objects is variable. Consequently, the requirements for the size of the dataset and the benchmark of keypoint labeling when making training samples are highly demanding. Currently, keypoint training samples for posture predominantly target the human body. Papandreou et al. (2017) constructed a heat map-based regression model based on the object bounding box output from FasterR-CNN and obtained more accurate results. Newell et al. (2016) proposed a stacked hourglass network. Feature extraction is achieved at different resolutions through up- and down-sampling, and keypoint classification heat maps are output at the last level. The graph-to-graph fully convolutional network of UNet (Ronneberger et al. 2015) can also be used for keypoint detection. HRNet was proposed by Microsoft Asia Research Institute (Sun et al. 2019), which changed the convolutional network from a series structure to a parallel structure for the first time and considerably improved the ability of the network to extract information from feature maps with different resolutions. Bulat et al. (2020) found that some network layers in the keypoint detection network do not need to be residual-linked and proposed adding connection weights to the network layers to determine whether residual-linking is needed via learning. There are many methods or models for single-object detection, but fish in a breeding pond often engage in clustering, resulting in many overlapping regions between the objects. Therefore, detecting the keypoints of the main objects of interest becomes difficult.

Considering the aforementioned problems, this study proposes a detection regression-based scheme for locating fish keypoints. The proposed approach essentially employs an optimized object detection algorithm (YOLOv5) and the human pose estimation network (Lite-HRNet) for locating fish keypoints, leading to the incorporation of more features without increasing the computational effort by improving the YOLOv5 neck network and loss function. The YOLOv5 model has the advantages of high accuracy and efficiency, while the Lite-HRNet model can effectively reduce the model size with high accuracy and efficiency. By improving Lite-HRNet with weight assignment and a distribution-aware strategy, a deep learning model for fish keypoint detection is constructed, which enables the fish keypoint detection model to identify keypoints efficiently and accurately.

2 Fish keypoint detection

This section introduces the deep learning-based framework for locating fish keypoints. Figure 1 shows the overall structure of the framework. We also introduce two datasets used for training the two stages of our models.

2.1 Fish object detection

2.1.1 Fish detection dataset with augmentation

During the object detection phase, fish are the only objects to be detected. Our study was conducted in a marine aquaculture environment, and one label of fish was available at the time of creating the YOLO format dataset. Owing to the complex aquaculture environment, 3,000 fish images were obtained through manual photography and web search and used as the dataset, as shown in Fig. 2. These data were partitioned into a training set and test set with a ratio of 4:1.

Using deep learning methods for object detection, the proposed scheme has demanding requirements for datasets, including the number of data samples and the diversity of objects. We can achieve more satisfactory training results only by constructing diverse datasets. The high cost of acquiring farmed fish datasets makes it difficult to acquire datasets that are as good as those produced by professional teams, particularly in terms of the richness of the scenes. Therefore, only samples from different breeding pools of a single farm are used. Hence, considering the size and quality of the dataset, it is necessary to use data augmentation techniques to increase the quantity and richness of the data during the training phase to ensure that the model can achieve better results.

As shown in Fig. 2, the environment within the breeding ponds is relatively simple and generally contains only a relatively single-colored background and objects (fish). Directly using the images captured in the breeding pond as the input to YOLOv5 will result in a model that performs well with a single scene and degrades severely when the testing scene becomes complex. Furthermore, because there is only one type of label, the model is not sensitive to the distinction between other objects and fish. Consequently, it can potentially produce false positives once other objects that are not in the dataset are present.

To ensure the robustness of YOLOv5 object detection and to address the challenge of object distinction insensitivity, the copy-paste strategy (Ghiasi et al. 2020) can be used for data enhancement to introduce objects from other scenarios. Simultaneously, the number and richness of the dataset can be expanded.

The core idea of the copy-paste method is to take instances from the original image and paste them into another image to address the occlusion problem in semantic segmentation by artificially and intentionally creating overlaps and using such sets for training.

In this study, 100 images of various scene styles are collected in addition to moving the object to a certain extent, as shown in Fig. 3. As an additional dataset background, 20$\%$ of the samples are randomly selected for cutting and superimposing fish images into new scenes (Fig. 4) so that they are contrasted against other objects in the scene. Thus, the algorithm can learn the difference between the fish and other objects in complex scenes and improve the feature recognition capability of the model. In addition to the traditional methods, we use some image quality augmentation methods, such as random flip, sharpness change, contrast adjustment, saturation change, and color change to improve the quality of modeling, as shown in Table 1.

Table 1 Data enhancement types and parameters

Full size table

2.1.2 Fish object detection based on improved YOLOv5

To accurately locate keypoints on fish bodies, we first need to detect individual fish and then find the keypoint locations. Deep learning-based object detection algorithms are primarily classified into two categories: one-stage and two-stage. The common one-stage object detection algorithms include YOLOv5, SSD, and RetinaNet. The common two-stage object detection algorithms include R-CNN, FastR-CNN, and FasterR-CNN. The real farming environment requires high detection speed. Although FasterR-CNN has substantially improved the detection speed through several optimizations on the original R-CNN and FastR-CNN algorithms, its detection speed is still insufficient for real-time monitoring. The YOLO series algorithm can improve the detection speed while maintaining high detection accuracy, thus meeting the demand for real-time monitoring. Compared with the two-stage object detection algorithm, the YOLO series algorithm has a substantial detection speed advantage. YOLOv5 is a major version of the YOLO series, which can adaptively calculate the best anchor frame value for different training sets and improve image scaling on the input side. In addition, adding a focus structure to the backbone network considerably reduces computation and memory usage. Therefore, YOLOv5 is suitable for large-scale deployment in scenarios with limited computational resources.

The YOLOv5 network comprises several different components and structures, among which the CBL includes a convolutional layer, BN (Batch Normalization) layer, and LeakyRelu activation function, while the Resunit comprises two CBLs and a shortcut link, which together form the basic structure of the YOLOv5 network. YOLOv5 employs various CSP (Cross-Stage Partial Network) structures in the backbone network and the neck to enhance the interactive performance between features. The focus structure uses slicing techniques to reduce the resolution of the input image, thus reducing the computational effort. In addition, the neck part uses feature pyramid network (FPN) and path aggregation network (PAN) structures to achieve multiscale feature fusion, considerably improving the efficiency of small object detection. FPN is a network structure for solving multiscale problems in object detection. It can effectively reduce the computational effort and improve the small object detection performance by adjusting the network hierarchy, as shown in Fig. 5. PAN improves the expressive ability of FPN by introducing a bottom-up feature fusion path; the structure of PAN is shown in Fig. 6.

PANet simply adds the fusion path from the bottom to the top when performing feature fusion and does not consider the different contributions of different features. In the farming environment, there are color and size differences among different types of fish species and different preferences for the extracted features. To address this problem, EfficentDet-BiFPN is used instead of PANet. BiFPN adds paths with added contextual information on top of FPN. Each path corresponds to a weight to achieve multiscale feature fusion. The structure diagram is shown in Fig. 7. BiFNP exhibits considerable improvements and advantages over PANet:

(1)
BiFNP eliminates the node with only one input edge, considerably simplifying the bidirectional network.
(2)
When the input and output nodes are at the same level, BiFNP can incorporate more features by adding another edge.
(3)
BiFNP treats each bidirectional pathway as a feature layer, which achieves a higher level of feature fusion.

2.2 Locating fish keypoints

Once we have achieved individual fish object detection, we can locate the keypoints of each part of the fish. This section first introduces the definition of keypoints and then presents the implementation of fish keypoint detection.

2.2.1 Fish keypoints dataset

As shown in Fig. 8, most farmed fish have the following seven structures on which keypoints can be selected, including mouth, eye, anterior end of dorsal fin (dorsal fin1), posterior end of dorsal fin (dorsal fin2), tail fin, anal fin, and pelvic fin. The parts represented by these 7 keypoints exist on almost all fish, and they can encompass the whole body of the fish, which is more representative of the fish’s posture when swimming and is also beneficial to the accuracy of the subsequent reprojection calculations.

We constructed a dataset containing 2000 images for this study, some of which were taken from Hainan Chenhai Aquatic Co. and others from the Internet. First, the coordinates of the detection frame were output by the object detection module of the YOLOv5 network. After cropping according to the coordinates and obtaining a single image of the fish body, we used the open-source image annotation tool VGG image annotator to annotate the keypoints of the fish and obtain 2000 images containing information regarding the keypoints of the fish body. Representative labeled keypoint data is shown in Fig. 9.

2.2.2 Locating fish keypoints

To accurately determine the locations of fish keypoints and improve the efficiency of keypoint detection while reducing sensitivity to low-resolution images, we drew inspiration from the work of the Lite-HRNet network, which was originally designed for locating keypoints on human bodies. However, we improve the network from two perspectives:

(1)
Keypoints are assigned corresponding weights.
(2)
More accurate keypoint regression is achieved.

Essentially, the Lite-HRNet uses the Shuffle block in the lightweight backbone network of the previous work of ShuffleNet to replace the modules in HRNet (Zhang et al. 2018). The structure of the Shuffle block is shown in Fig. 10. Since HRNet performs information exchange of parallel subnetworks using 1 $\times$ 1 convolution, the traversal calculation of each point of the feature map becomes a computational bottleneck of the network. Therefore, Lite-HRNet proposes channel weighting (Conditional Channel Weighting, CCW) to replace the 1 $\times$ 1 convolution, as shown in Fig. 10. In this study, Lite-HRNet-30, based on an input scale of 384 $\times$ 288, is used as the keypoint detection network to achieve good accuracy.

(1)
Improved Lite-HRNet network based on weight distribution The keypoint detection model in this paper needs to detect 7 keypoints on various parts of the fish body. In a real breeding environment, some fish may have smaller pelvic fins, and some fish may have narrower dorsal fins. Consequently, if the keypoints of each part were to have the same weight, it would lead to poor detection performance for parts with less pixel information. Meanwhile, the model will gradually tend to approach other points with richer pixel information during the training process. Therefore, this study treats the weight of each keypoint as a hyperparameter and assigns 1.5 times the normal weight to points with less pixel information, such as eyes. At the same time, when calculating the loss, we multiply the loss of such keypoints by an expansion coefficient greater than 1. Farm managers can also use this parameter to modify the model according to the specific body morphologies of various fish species so that the detection performance of the model for those types of fish is better.
(2)
Further improvement based on the distribution-aware coordinate representation of keypoint (DARK) (Zhang et al. 2019) When building the Lite-HRNet keypoint network, the resolution of the single fish image clipped out by YOLOv5 in the previous section was uniformly reset to 384 $\times$ 288. After quadruple downsampling of the model, the feature map resolution was 96 $\times$ 72. While downsampling enables the model to learn deeper semantic features, it also causes some problems. As shown in Fig. 11, when the keypoint coordinates of the final feature map output are (1, 1), they are transformed to (4, 4) when restored to the original map. Assuming that the keypoint coordinates on the original map are (5, 5) and the sampled coordinates are (1.25, 1.25), the coordinates of the feature map are very close to (1, 1). This is because the resolution decreases, and the coordinates transform to (4, 4) when upsampling is performed to recover the original map. Therefore, this process of downsampling followed by upsampling causes some deviation in the coordinates. The magnitude of this deviation is proportional to the sampling magnification. Distribution-awareness-based improvement is designed to solve this problem. The DARK method consists of efficient Taylor-expansion-based coordinate decoding and unbiased sub-pixel centered coordinate encoding (Zhang et al. 2019).

First of all, DARK assumes that the final output 2D coordinates of the keypoint detection conform to the two-dimensional Gaussian distribution as shown below:

$$\begin{aligned} \mathcal {G}(x ; \mu , \Sigma )=\frac{1}{(2 \uppi )|\Sigma |^{\frac{1}{2}}} \exp \left( -\frac{1}{2}(x-\mu )^T \Sigma ^{-1}(x-\mu )\right) \end{aligned}$$

(1)

Taking the logarithm, we get:

$$\begin{aligned} \mathcal {P}(x ; \mu , \Sigma )=\ln (\mathcal {G})=-\ln (2 \uppi )-\frac{1}{2} \ln (|\Sigma |)-\frac{1}{2}(x-\mu )^T \Sigma ^{-1}(x-\mu ) \end{aligned}$$

(2)

Let us draw Eq. (2) as Fig. 12. It can be seen from the figure that limiting the resolution will change the position of the original keypoint from point $\mu$ to point m. The correct keypoint position should be at the maximum value of point $\mu$, where the derivative is 0.

Eq. (2) is derived at $\mu$ to obtain Eq. (3).

$$\begin{aligned} \left. \mathcal {D}^{\prime }(x)\right| _{x=\mu }=\left. \frac{\partial \mathcal {P}^T}{\partial x}\right| _{x=\mu }=-\left. \Sigma ^{-1}(x-\mu )\right| _{x=\mu }=0 \end{aligned}$$

(3)

At the same time, Eq. (2) is expanded using the Taylor series to get:

$$\begin{aligned} \mathcal {P}(x)=\mathcal {P}(m)+\mathcal {D}^{\prime }(m)(x-m)+\frac{1}{2}(x-m)^T \mathcal {D}^{\prime \prime }(m)(x-m) \end{aligned}$$

(4)

Eq. (4) is derived to obtain:

$$\begin{aligned} \mathcal {P}^{\prime }(x)=\mathcal {D}^{\prime }(m)+\mathcal {D}^{\prime \prime }(m)(x-m) \end{aligned}$$

(5)

Substituting into Eq. (5), we get:

$$\begin{aligned} \mathcal {P}^{\prime }(\mu )=\mathcal {D}^{\prime }(m)+\mathcal {D}^{\prime \prime }(m)(\mu -m)=0 \end{aligned}$$

(6)

Transform Eq. (6) to get:

$$\begin{aligned} \mu =m-\left( \mathcal {D}^{\prime \prime }(m)\right) ^{-1} \mathcal {D}^{\prime }(m) \end{aligned}$$

(7)

It can be seen from Eq. (7) that the point can be obtained by the first-order and second-order derivatives of $\mathcal {D}(m)$, and the second-order derivative of the point can be obtained by Eq. (8).

$$\begin{aligned} \mathcal {D}^{\prime \prime }(m)=\left. \mathcal {D}^{\prime \prime }(x)\right| _{x=m}=-\Sigma ^{-1} \end{aligned}$$

(8)

Arranging the above formulas, we get:

$$\begin{aligned} \mu =m-\left( \mathcal {D}^{\prime \prime }(m)\right) ^{-1} \mathcal {D}^{\prime }(m) \end{aligned}$$

(9)

$\mu$ can be obtained using the above formula, and the original keypoint coordinates can be corrected. Exploration of the distribution statistics in the heat map can result in more accurate final 2D keypoint coordinate results. More importantly, this decoding method only requires the calculation of the first-order and second-order derivatives and, thus, does not involve a particularly high computational effort. As this improved distribution-aware method requires almost no structural changes to the original model, this method can also be migrated to most keypoint detection algorithms. In particular, it is only necessary to input the heat map predicted by the trained Lite-HRNet into the decoding method at the keypoint prediction stage, and the potential maximum activation position can be inferred, thus improving the accuracy of the model.

3 Results and analysis

3.1 Fish object detection results and analysis

The model converged after 50 cycles of model training. The performance of the model was tested in the underwater aquaculture pond; the results are shown in Fig. 13. As can be seen in Fig. 13, the YOLOv5 model trained with data augmentation using the copy-paste strategy achieves satisfactory fish detection results and can accurately locate the identifiable fish in the image.

We further set up a control group experiment, which has the same initialization parameters, training platform, training period, and training dataset for the YOLOv5 model. This is to verify that the training dataset with copy and paste-enhanced quality can improve the detection ability of the model in complex scenes. As shown in Fig. 14, it is easy to produce an overfit phenomenon in complex scenes, misidentifying other objects as object fish after the training of the control group reaches convergence and runs on the test set. After applying the copy-paste strategy, the sensitivity against different categories in complex scenes is improved.

In this study, mAP@0.5, mAP@0.75, and mAP@0.5:0.95 are used as indicators (Jiao et al. 2019) to evaluate the detection performance of the YOLOv5 model. Results are shown in Table 2. The baseline in the table indicates that only the official version of YOLOv5 is used for training. +BC indicates training with improved YOLOv5. +CP indicates the addition of data enhancement training based on this baseline. +BCP indicates the addition of data enhancement training based on +BC. The results presented in Table 2 show that the improved YOLOv5 and copy-paste strategies can considerably improve the accuracy and are applicable to the farm environment.

Table 2 Improved YOLOv5 object detection performance evaluation. The detection performance of the YOLOv5 model was evaluated using mAP@0.5, mAP@0.75, and mAP@0.5:0.95 as metrics

Full size table

3.2 Performance evaluation of fish keypoint detection model

The Lite-HRNet model is assigned random values via He initialization (He et al. 2015). It converges after 50 rounds of training. The test data uses a local image of a single fish, which is output and clipped by the improved YOLOv5. The prediction threshold is 0.5, and the result for each keypoint is the most confident pixel in the heatmap. As shown in Fig. 15, part of the anal fin of the crucian carp above is covered, but it can still be identified relatively accurately. As can be seen in Fig. 16, the keypoint detection of images from real farm environments achieved a relatively good effect.

Some experimental results are not satisfactory, as shown in Fig. 17. When the fish body is of a single color, determining the distribution of the keypoints in the eyes is difficult. When the identified fish body is distorted or obstructed, the recognition rate of keypoints in various parts decreases. In a scene with too many fish, some keypoints of the fish are missing, resulting in substantial information loss. For future work, we can consider utilizing the statistical counting method to supplement the key point information or combining the results of the front and back frames to make better estimations.

The model framework in this paper is built on the basis of a high-resolution network. It integrates multiple features of different resolutions and layers within the network. The model evaluation metric uses pair-wise node similarity (Ronchi et al. 2017) (OKS), defined as follows:

$$\begin{aligned} O K S_p=\frac{\sum _i \exp \left\{ -d_{p i}^2 / 2 S_p^2 \sigma _i^2\right\} \delta \left( v_{p i}=1\right) }{\sum _i \delta \left( v_{p i}=1\right) } \end{aligned}$$

(10)

In Eq. (10), $d_{p_i}$ represents the Euclidean distance from a keypoint to the predicted point, and $S_p$ denotes the object scale. $\sigma _i$ represents the normalization factor of the i-th keypoint of the object, $\delta$ is a function used to filter visible points, $v_{p i}$ indicates whether the i-th keypoint of the i-th fish is visible, i is the id of the keypoint, and p is the id of the fish object instance.

Taking the average OKS as the keypoint detection index, the performance of different strategies is shown in Table 3. It can be seen that both the weight distribution and DARK help the accuracy of the model. There is an obvious increase in the value of the indicator as compared to the original network, DARK.

Table 3 Keypoint Detection Performance Evaluation. Using the average OKS as the keypoint detection indicator, the performance of different strategies is shown in the Table

Full size table

4 Conclusions

This study addresses the problem of fish keypoint detection and locates seven predefined fish keypoints based on a detection-regression scheme. Results show that this algorithm can maintain good performance in complex underwater environments. The main conclusions of this study are as follows:

(1)
The accuracy and recall rate of this method in identifying fish objects and their keypoints are 70.1$\%$ and 73.4$\%$, respectively; the FPS is 20 frames per second. Thus, the keypoints of fish can be acquired in real-time by feeding fish video images.
(2)
A comparison of the current fish keypoint detection methods shows that the proposed method has clear advantages in terms of efficiency and accuracy. This can provide technical support and insights into the subsequent analysis of fish behavioral patterns and related studies.

Although the method presented in this paper performs well on the datasets in this study, it can be improved further. The object detection module and the keypoint detection module are separated in the diagram of the proposed method. In addition, the datasets for training the object detection module and keypoint detection module are distinct. Future endeavors could consider integrating the two modules into a single pipeline to realize end-to-end detection, thus improving the overall training efficiency of the model.

Availability of data and materials

Associated data is available at https://github.com/oucvisionlabsanya/fish_keypoint_detection.git .

References

Bulat A, Kossaifi J, Tzimiropoulos G, Pantic M (2020) Toward fast and accurate human pose estimation via soft-gated skip connections. 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition, Buenos Aires, Argentina, 16-20 November 2020, pp 8–15
Ghiasi G, Cui Y, Srinivas A, Qian R, Lin TY, Cubic ED (2020) Simple copy-paste is a strong data augmentation method for instance segmentation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, pp 2918–2928
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 07-13 December 2015, pp 1026–1034
Jiao L, Zhang F, Liu F (2019) A survey of deep learning-based object detection. IEEE Access 7:128837–128868
Article Google Scholar
Li ZF, Lin DY, Peng XF (2021) Template matching based rutting machine gap detection algorithm. J China Railw Soc 43(08):88–96
Lin B, Jiang KL, Xu ZQ, Li FY, Li J, Mou CL, et al (2021) Feasibility research on fish pose estimation based on rotating box object detection. Fishes 6(4):65
Article Google Scholar
Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. European Conference on Computer Vision. Springer, Cham, pp 483–499
Papandreou G, Zhu T, Kanazawa N, Toshev A, Tompson J, Chris B et al (2017) Towards accurate multi-person pose estimation in the wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 9 November 2017, pp 3711–3719
Ronchi MR, Perona P (2017) Benchmarking and error diagnosis in multi-instance pose estimation. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22-29 October 2017, pp 369–378
Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention - MICCAI, Springer, Cham
Google Scholar
Sun K, Xiao B, Liu D, Wang JD (2019) Deep high-resolution representation learning for human pose estimation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15-20 June 2019, pp 5693–5703
Suo F, Huang K, Ling G, Li Y, Xiang J (2020) Fish keypoints detection for ecology monitoring based on underwater visual intelligence. 2020 16th International Conference on Control, Automation, Robotics and Vision (ICARCV), pp 542–547
Tompson J, Jain A, Lecun Y, Berger C (2014) Joint training of a convolutional network and a graphical model for human pose estimation. arXiv:1406.2984
Toshev A, Szegedy C (2014) Deeppose: human pose estimation via deep neural networks. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23-28 June 2014, pp 1653–1660
Wu XP, Guan YP (2019) Multi-pose face recognition based on face keypoints and incremental clustering. Laser and Optoelectronics Progress 56(14):62–70
Yang TY, Nguyen D K, Heijnen H (2020) UR2KiD: unifying Retrieval, Keypoint Detection, and Keypoint Description without Local Correspondence Supervision. arXiv e-prints arXiv:2001.07252
Zhang F, Zhu X, Dai HB, Mao Y, Ce, Z (2019) Distribution-aware coordinate representation for human pose estimation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 7093–7102
Zhang XH, Li B, Yang D (2007) A novel harris multi-scale corner detection algorithm. J Electron Inf Technol 29(7):1735–1738
Zhang X, Zhou X, Lin M, Sun J (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18-23 June 2018, 6848–6856

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (Grant Nos. 42106193, 41927805). We would like to thank the anonymous reviewers for their insightful and valuable comments on an earlier version of this manuscript.

Author information

Authors and Affiliations

Faculty of Information Science and Engineering, Ocean University of China, Qingdao, 266100, China
Junyu Dong, Yanhai Gan & Hao Fan
Sanya Oceanographic Institution, Ocean University of China, Sanya, 572000, China
Junyu Dong, Xinyu Shangguan & Kaiming Zhou
Department of Medical Physics and Biomedical Engineering, University College London, London, W12 8LP, UK
Long Chen

Authors

Junyu Dong
View author publications
You can also search for this author in PubMed Google Scholar
Xinyu Shangguan
View author publications
You can also search for this author in PubMed Google Scholar
Kaiming Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yanhai Gan
View author publications
You can also search for this author in PubMed Google Scholar
Hao Fan
View author publications
You can also search for this author in PubMed Google Scholar
Long Chen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Junyu Dong: Conceptualization, Resources, Supervision, Funding Acquisition; Xinyu Shangguan: Conceptualization, Methodology, Formal Analysis, Writing-Original Draft; Kaiming Zhou: Software, Validation, Data Curation, Writing-Original Draft; Yanhai Gan: Methodology, Investigation, Data Curation; Hao Fan: Methodology, Investigation, Funding Acquisition; Long Chen: Investigation, Writing-Original Draft.

Corresponding author

Correspondence to Junyu Dong.

Ethics declarations

Ethics approval and consent to participate

No ethical approval was necessary for this work. Submission declaration and verification: We confirm that our work is original. Our manuscript has not been published, nor is it currently under consideration for publication, elsewhere.

Junyu Dong, Xinyu Shangguan, Kaiming Zhou, Yanhai Gan, Hao Fan, and Long Chen declare that they consent to participate.

Consent for publication

Junyu Dong, Xinyu Shangguan, Kaiming Zhou, Yanhai Gan, Hao Fan, and Long Chen declare their consent for publication.

Competing interests

No conflict of interest exists in the submission of this manuscript, and the manuscript is approved by all authors for publication. All the authors listed have approved the manuscript that is enclosed. The authors declare that we have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Junyu Dong is the Editer-in-Chief of the journal, but he was not involved in the journal’s review of, or decision related to, this manuscript.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Dong, J., Shangguan, X., Zhou, K. et al. A detection-regression based framework for fish keypoints detection. Intell. Mar. Technol. Syst. 1, 9 (2023). https://doi.org/10.1007/s44295-023-00002-3

Download citation

Received: 06 August 2023
Revised: 31 August 2023
Accepted: 05 September 2023
Published: 20 October 2023
DOI: https://doi.org/10.1007/s44295-023-00002-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A detection-regression based framework for fish keypoints detection

Abstract

Similar content being viewed by others

Using channel pruning–based YOLOv5 deep learning algorithm for accurately counting fish fry in real time

Computer Vision Models in Intelligent Aquaculture with Emphasis on Fish Detection and Behavior Analysis: A Review

Robust Fish Enumeration by Multiple Object Tracking in Overhead Videos

1 Introduction