1 Introduction

Fish are a crucial part of people’s diets owing to their tender flesh, high-quality protein, and other nutrients. However, in addition to the low level of automation, there is a lack of use of Artificial Intelligence in fish aquaculture. A substantial amount of manual intervention is constantly required in key aspects of fish aquaculture such as feeding and health monitoring. This manual method cannot monitor fish growth in real time, and there is a lag in the information obtained, making timely adjustments to the aquaculture procedures difficult. Computer vision technology has been widely used in recent years. The use of intelligent vision methods in fish aquaculture not only saves human and material resources but also provides more timely, accurate, and stable data, which considerably improves production efficiency and management. Capturing images in the aquaculture pond in real time through underwater cameras enables contactless collection of fish data, which can reduce the stress and physical damage to the fish population caused by traditional methods of data collection. By increasing the monitoring time to several years in the aquaculture period, fish farms can save considerably on labor costs and achieve better economic benefits.

Keypoints on fish bodies, such as certain points on the mouth, eye, anterior end of dorsal fin, posterior end of dorsal fin, and tail fin, are important biological indicators and can be used to calculate individual size, mass, and behavior. Therefore, developing accurate keypoint detection methods for fish bodies is necessary. With the development of deep learning technology and the deepening of neural network models, the accuracy and efficiency of object detection have been considerably improved. This can enable more convenient and accurate services and facilitate a better understanding of the morphological features via recognition, location, and classification of objects in images or videos. However, locating fish keypoints has been a challenging problem, and few related studies are available in the literature. Recent studies have mainly focused on detecting keypoints for stereo matching (Suo et al. 2020; Lin et al. 2021). The large variability in fish morphology and the influence of factors such as light and shading make accurately locating fish keypoints difficult. Solving this problem can provide considerable benefits for fish production and conservation.

Currently, studies on fish keypoint detection remain scarce. Keypoint detection is more commonly applied in the fields of facial recognition and human pose estimation. Therefore, borrowing ideas from human pose or keypoint estimation methods can be useful with respect to fish keypoint detection. Traditional keypoint detection is achieved via feature operator filtering, template matching (Li et al. 2021), and corner point detection (Zhang et al. 2007). Yang et al. (2020) used machine learning technology to achieve fast detection of three-dimensional surface keypoints, which substantially improved the detection efficiency and accuracy. Wu and Guan (2019) successfully constructed an efficient face recognition model using face keypoint information combined with Long Short-Term Memory. Toshev and Szegedy (2014) conducted the first study on a single-object two-dimensional (2D) human pose and proposed a multistage, iterative, deep convolutional neural network structure based on coordinate attribution. Using the network for the feature extraction of keypoints followed by coordinate regression, iterations were performed to obtain the locations of keypoints and 2D pose estimation. However, because this method directly regresses the spatial locations of keypoints, model convergence becomes difficult. Consequently, the performance cannot be guaranteed, and the obtained results are difficult to migrate to other scenarios. Tompson et al. (2014) proposed a multistage convolutional neural network combined with a 2D pose estimation method and a Markov random-field model. This model uses a multistage cascade structure to extract features from keypoints and regress the feature information to obtain the location of keypoints. However, it still has issues regarding poor generalization performance due to coordinate regression.

The abovementioned two-dimensional human pose estimation algorithm in the network primarily uses a supervised training method for feature extraction of keypoints. A high-quality dataset for keypoints is required owing to differences in the spatial distribution and the number of keypoints for different objects. Moreover, migrating the learning by directly using other methods is difficult. When relevant training samples are created, the keypoints of the objects should be identified and labeled according to the spatial requirements. For rigid objects such as fixed-wing vehicles, a strong and stable correlation between the keypoints and pose is observed. For nonrigid objects, such as human bodies and fish, there are more keypoints, which can move freely. Therefore, the pose structure of nonrigid objects is variable. Consequently, the requirements for the size of the dataset and the benchmark of keypoint labeling when making training samples are highly demanding. Currently, keypoint training samples for posture predominantly target the human body. Papandreou et al. (2017) constructed a heat map-based regression model based on the object bounding box output from FasterR-CNN and obtained more accurate results. Newell et al. (2016) proposed a stacked hourglass network. Feature extraction is achieved at different resolutions through up- and down-sampling, and keypoint classification heat maps are output at the last level. The graph-to-graph fully convolutional network of UNet (Ronneberger et al. 2015) can also be used for keypoint detection. HRNet was proposed by Microsoft Asia Research Institute (Sun et al. 2019), which changed the convolutional network from a series structure to a parallel structure for the first time and considerably improved the ability of the network to extract information from feature maps with different resolutions. Bulat et al. (2020) found that some network layers in the keypoint detection network do not need to be residual-linked and proposed adding connection weights to the network layers to determine whether residual-linking is needed via learning. There are many methods or models for single-object detection, but fish in a breeding pond often engage in clustering, resulting in many overlapping regions between the objects. Therefore, detecting the keypoints of the main objects of interest becomes difficult.

Considering the aforementioned problems, this study proposes a detection regression-based scheme for locating fish keypoints. The proposed approach essentially employs an optimized object detection algorithm (YOLOv5) and the human pose estimation network (Lite-HRNet) for locating fish keypoints, leading to the incorporation of more features without increasing the computational effort by improving the YOLOv5 neck network and loss function. The YOLOv5 model has the advantages of high accuracy and efficiency, while the Lite-HRNet model can effectively reduce the model size with high accuracy and efficiency. By improving Lite-HRNet with weight assignment and a distribution-aware strategy, a deep learning model for fish keypoint detection is constructed, which enables the fish keypoint detection model to identify keypoints efficiently and accurately.

2 Fish keypoint detection

This section introduces the deep learning-based framework for locating fish keypoints. Figure 1 shows the overall structure of the framework. We also introduce two datasets used for training the two stages of our models.

Fig. 1
figure 1

Framework of study. After applying the optimized YOLOv5 object detection algorithm, an improved Lite-HRNet network is applied to the fish keypoint location

2.1 Fish object detection

2.1.1 Fish detection dataset with augmentation

During the object detection phase, fish are the only objects to be detected. Our study was conducted in a marine aquaculture environment, and one label of fish was available at the time of creating the YOLO format dataset. Owing to the complex aquaculture environment, 3,000 fish images were obtained through manual photography and web search and used as the dataset, as shown in Fig. 2. These data were partitioned into a training set and test set with a ratio of 4:1.

Fig. 2
figure 2

Object detection dataset. We acquired three thousand real fish images through manual photography and web search and used them as a dataset

Using deep learning methods for object detection, the proposed scheme has demanding requirements for datasets, including the number of data samples and the diversity of objects. We can achieve more satisfactory training results only by constructing diverse datasets. The high cost of acquiring farmed fish datasets makes it difficult to acquire datasets that are as good as those produced by professional teams, particularly in terms of the richness of the scenes. Therefore, only samples from different breeding pools of a single farm are used. Hence, considering the size and quality of the dataset, it is necessary to use data augmentation techniques to increase the quantity and richness of the data during the training phase to ensure that the model can achieve better results.

As shown in Fig. 2, the environment within the breeding ponds is relatively simple and generally contains only a relatively single-colored background and objects (fish). Directly using the images captured in the breeding pond as the input to YOLOv5 will result in a model that performs well with a single scene and degrades severely when the testing scene becomes complex. Furthermore, because there is only one type of label, the model is not sensitive to the distinction between other objects and fish. Consequently, it can potentially produce false positives once other objects that are not in the dataset are present.

To ensure the robustness of YOLOv5 object detection and to address the challenge of object distinction insensitivity, the copy-paste strategy (Ghiasi et al. 2020) can be used for data enhancement to introduce objects from other scenarios. Simultaneously, the number and richness of the dataset can be expanded.

The core idea of the copy-paste method is to take instances from the original image and paste them into another image to address the occlusion problem in semantic segmentation by artificially and intentionally creating overlaps and using such sets for training.

Fig. 3
figure 3

Partial scene diagram. Hundreds of images of various types of scene styles were collected as additional dataset backgrounds

In this study, 100 images of various scene styles are collected in addition to moving the object to a certain extent, as shown in Fig. 3. As an additional dataset background, 20\(\%\) of the samples are randomly selected for cutting and superimposing fish images into new scenes (Fig. 4) so that they are contrasted against other objects in the scene. Thus, the algorithm can learn the difference between the fish and other objects in complex scenes and improve the feature recognition capability of the model. In addition to the traditional methods, we use some image quality augmentation methods, such as random flip, sharpness change, contrast adjustment, saturation change, and color change to improve the quality of modeling, as shown in Table 1.

Fig. 4
figure 4

Scene graph processed via copy-paste. Randomly selected 20\(\%\) samples of fish are cropped and superimposed onto the new scene, placing the fish in different scenes so that they are in clear contrast to other objects in the scene

Table 1 Data enhancement types and parameters

2.1.2 Fish object detection based on improved YOLOv5

To accurately locate keypoints on fish bodies, we first need to detect individual fish and then find the keypoint locations. Deep learning-based object detection algorithms are primarily classified into two categories: one-stage and two-stage. The common one-stage object detection algorithms include YOLOv5, SSD, and RetinaNet. The common two-stage object detection algorithms include R-CNN, FastR-CNN, and FasterR-CNN. The real farming environment requires high detection speed. Although FasterR-CNN has substantially improved the detection speed through several optimizations on the original R-CNN and FastR-CNN algorithms, its detection speed is still insufficient for real-time monitoring. The YOLO series algorithm can improve the detection speed while maintaining high detection accuracy, thus meeting the demand for real-time monitoring. Compared with the two-stage object detection algorithm, the YOLO series algorithm has a substantial detection speed advantage. YOLOv5 is a major version of the YOLO series, which can adaptively calculate the best anchor frame value for different training sets and improve image scaling on the input side. In addition, adding a focus structure to the backbone network considerably reduces computation and memory usage. Therefore, YOLOv5 is suitable for large-scale deployment in scenarios with limited computational resources.

The YOLOv5 network comprises several different components and structures, among which the CBL includes a convolutional layer, BN (Batch Normalization) layer, and LeakyRelu activation function, while the Resunit comprises two CBLs and a shortcut link, which together form the basic structure of the YOLOv5 network. YOLOv5 employs various CSP (Cross-Stage Partial Network) structures in the backbone network and the neck to enhance the interactive performance between features. The focus structure uses slicing techniques to reduce the resolution of the input image, thus reducing the computational effort. In addition, the neck part uses feature pyramid network (FPN) and path aggregation network (PAN) structures to achieve multiscale feature fusion, considerably improving the efficiency of small object detection. FPN is a network structure for solving multiscale problems in object detection. It can effectively reduce the computational effort and improve the small object detection performance by adjusting the network hierarchy, as shown in Fig. 5. PAN improves the expressive ability of FPN by introducing a bottom-up feature fusion path; the structure of PAN is shown in Fig. 6.

Fig. 5
figure 5

Structure of FPN. FPN is a network structure for solving multiscale problems in object detection

Fig. 6
figure 6

Structure of PAN. First, the simple network structure is changed for the multiscale object detection problem so that the small object detection performance is considerably improved without increasing the computational effort. PAN enhances FPN expression by introducing a bottom-up feature fusion path

PANet simply adds the fusion path from the bottom to the top when performing feature fusion and does not consider the different contributions of different features. In the farming environment, there are color and size differences among different types of fish species and different preferences for the extracted features. To address this problem, EfficentDet-BiFPN is used instead of PANet. BiFPN adds paths with added contextual information on top of FPN. Each path corresponds to a weight to achieve multiscale feature fusion. The structure diagram is shown in Fig. 7. BiFNP exhibits considerable improvements and advantages over PANet:

  1. (1)

    BiFNP eliminates the node with only one input edge, considerably simplifying the bidirectional network.

  2. (2)

    When the input and output nodes are at the same level, BiFNP can incorporate more features by adding another edge.

  3. (3)

    BiFNP treats each bidirectional pathway as a feature layer, which achieves a higher level of feature fusion.

Fig. 7
figure 7

BiFPN adds paths with contextual information on top of FPN, and each path will correspond to a weight, thus realizing multiscale feature fusion

2.2 Locating fish keypoints

Once we have achieved individual fish object detection, we can locate the keypoints of each part of the fish. This section first introduces the definition of keypoints and then presents the implementation of fish keypoint detection.

2.2.1 Fish keypoints dataset

As shown in Fig. 8, most farmed fish have the following seven structures on which keypoints can be selected, including mouth, eye, anterior end of dorsal fin (dorsal fin1), posterior end of dorsal fin (dorsal fin2), tail fin, anal fin, and pelvic fin. The parts represented by these 7 keypoints exist on almost all fish, and they can encompass the whole body of the fish, which is more representative of the fish’s posture when swimming and is also beneficial to the accuracy of the subsequent reprojection calculations.

Fig. 8
figure 8

Keypoint definition. Seven keypoints are selected, including mouth, eye, anterior end of dorsal fin (dorsal fin1), posterior end of dorsal fin (dorsal fin2), tail fin, anal fin, and pelvic fin

Fig. 9
figure 9

Keypoint labeling. The object detection module outputs the detection frame coordinates and then crops according to the coordinates to obtain a single image of the fish body and then annotates it

We constructed a dataset containing 2000 images for this study, some of which were taken from Hainan Chenhai Aquatic Co. and others from the Internet. First, the coordinates of the detection frame were output by the object detection module of the YOLOv5 network. After cropping according to the coordinates and obtaining a single image of the fish body, we used the open-source image annotation tool VGG image annotator to annotate the keypoints of the fish and obtain 2000 images containing information regarding the keypoints of the fish body. Representative labeled keypoint data is shown in Fig. 9.

2.2.2 Locating fish keypoints

To accurately determine the locations of fish keypoints and improve the efficiency of keypoint detection while reducing sensitivity to low-resolution images, we drew inspiration from the work of the Lite-HRNet network, which was originally designed for locating keypoints on human bodies. However, we improve the network from two perspectives:

  1. (1)

    Keypoints are assigned corresponding weights.

  2. (2)

    More accurate keypoint regression is achieved.

Essentially, the Lite-HRNet uses the Shuffle block in the lightweight backbone network of the previous work of ShuffleNet to replace the modules in HRNet (Zhang et al. 2018). The structure of the Shuffle block is shown in Fig. 10. Since HRNet performs information exchange of parallel subnetworks using 1 \(\times\) 1 convolution, the traversal calculation of each point of the feature map becomes a computational bottleneck of the network. Therefore, Lite-HRNet proposes channel weighting (Conditional Channel Weighting, CCW) to replace the 1 \(\times\) 1 convolution, as shown in Fig. 10. In this study, Lite-HRNet-30, based on an input scale of 384 \(\times\) 288, is used as the keypoint detection network to achieve good accuracy.

Fig. 10
figure 10

Lite-HRNet basic module structure. Lite-HRNet uses the Shuffle block from the lightweight backbone network ShuffleNet instead of the module in HRNet. The structure of the Shuffle block is shown above

  1. (1)

    Improved Lite-HRNet network based on weight distribution The keypoint detection model in this paper needs to detect 7 keypoints on various parts of the fish body. In a real breeding environment, some fish may have smaller pelvic fins, and some fish may have narrower dorsal fins. Consequently, if the keypoints of each part were to have the same weight, it would lead to poor detection performance for parts with less pixel information. Meanwhile, the model will gradually tend to approach other points with richer pixel information during the training process. Therefore, this study treats the weight of each keypoint as a hyperparameter and assigns 1.5 times the normal weight to points with less pixel information, such as eyes. At the same time, when calculating the loss, we multiply the loss of such keypoints by an expansion coefficient greater than 1. Farm managers can also use this parameter to modify the model according to the specific body morphologies of various fish species so that the detection performance of the model for those types of fish is better.

  2. (2)

    Further improvement based on the distribution-aware coordinate representation of keypoint (DARK) (Zhang et al. 2019) When building the Lite-HRNet keypoint network, the resolution of the single fish image clipped out by YOLOv5 in the previous section was uniformly reset to 384 \(\times\) 288. After quadruple downsampling of the model, the feature map resolution was 96 \(\times\) 72. While downsampling enables the model to learn deeper semantic features, it also causes some problems. As shown in Fig. 11, when the keypoint coordinates of the final feature map output are (1, 1), they are transformed to (4, 4) when restored to the original map. Assuming that the keypoint coordinates on the original map are (5, 5) and the sampled coordinates are (1.25, 1.25), the coordinates of the feature map are very close to (1, 1). This is because the resolution decreases, and the coordinates transform to (4, 4) when upsampling is performed to recover the original map. Therefore, this process of downsampling followed by upsampling causes some deviation in the coordinates. The magnitude of this deviation is proportional to the sampling magnification. Distribution-awareness-based improvement is designed to solve this problem. The DARK method consists of efficient Taylor-expansion-based coordinate decoding and unbiased sub-pixel centered coordinate encoding (Zhang et al. 2019).

Fig. 11
figure 11

Keypoint output error. The process of downsampling and upsampling can cause some deviation in the coordinates, the magnitude of which is proportional to the sampling magnification

First of all, DARK assumes that the final output 2D coordinates of the keypoint detection conform to the two-dimensional Gaussian distribution as shown below:

$$\begin{aligned} \mathcal {G}(x ; \mu , \Sigma )=\frac{1}{(2 \uppi )|\Sigma |^{\frac{1}{2}}} \exp \left( -\frac{1}{2}(x-\mu )^T \Sigma ^{-1}(x-\mu )\right) \end{aligned}$$
(1)

Taking the logarithm, we get:

$$\begin{aligned} \mathcal {P}(x ; \mu , \Sigma )=\ln (\mathcal {G})=-\ln (2 \uppi )-\frac{1}{2} \ln (|\Sigma |)-\frac{1}{2}(x-\mu )^T \Sigma ^{-1}(x-\mu ) \end{aligned}$$
(2)

Let us draw Eq. (2) as Fig. 12. It can be seen from the figure that limiting the resolution will change the position of the original keypoint from point \(\mu\) to point m. The correct keypoint position should be at the maximum value of point \(\mu\), where the derivative is 0.

Eq. (2) is derived at \(\mu\) to obtain Eq. (3).

$$\begin{aligned} \left. \mathcal {D}^{\prime }(x)\right| _{x=\mu }=\left. \frac{\partial \mathcal {P}^T}{\partial x}\right| _{x=\mu }=-\left. \Sigma ^{-1}(x-\mu )\right| _{x=\mu }=0 \end{aligned}$$
(3)
Fig. 12
figure 12

Gaussian distribution of keypoints. The out-resolution limitation will change the original keypoint position from point \(\mu\) to point m. The correct keypoint location should be at the extreme value of \(\mu\) point, i.e., the point where the derivative is 0

At the same time, Eq. (2) is expanded using the Taylor series to get:

$$\begin{aligned} \mathcal {P}(x)=\mathcal {P}(m)+\mathcal {D}^{\prime }(m)(x-m)+\frac{1}{2}(x-m)^T \mathcal {D}^{\prime \prime }(m)(x-m) \end{aligned}$$
(4)

Eq. (4) is derived to obtain:

$$\begin{aligned} \mathcal {P}^{\prime }(x)=\mathcal {D}^{\prime }(m)+\mathcal {D}^{\prime \prime }(m)(x-m) \end{aligned}$$
(5)

Substituting into Eq. (5), we get:

$$\begin{aligned} \mathcal {P}^{\prime }(\mu )=\mathcal {D}^{\prime }(m)+\mathcal {D}^{\prime \prime }(m)(\mu -m)=0 \end{aligned}$$
(6)

Transform Eq. (6) to get:

$$\begin{aligned} \mu =m-\left( \mathcal {D}^{\prime \prime }(m)\right) ^{-1} \mathcal {D}^{\prime }(m) \end{aligned}$$
(7)

It can be seen from  Eq. (7) that the point can be obtained by the first-order and second-order derivatives of \(\mathcal {D}(m)\), and the second-order derivative of the point can be obtained by  Eq. (8).

$$\begin{aligned} \mathcal {D}^{\prime \prime }(m)=\left. \mathcal {D}^{\prime \prime }(x)\right| _{x=m}=-\Sigma ^{-1} \end{aligned}$$
(8)

Arranging the above formulas, we get:

$$\begin{aligned} \mu =m-\left( \mathcal {D}^{\prime \prime }(m)\right) ^{-1} \mathcal {D}^{\prime }(m) \end{aligned}$$
(9)

\(\mu\) can be obtained using the above formula, and the original keypoint coordinates can be corrected. Exploration of the distribution statistics in the heat map can result in more accurate final 2D keypoint coordinate results. More importantly, this decoding method only requires the calculation of the first-order and second-order derivatives and, thus, does not involve a particularly high computational effort. As this improved distribution-aware method requires almost no structural changes to the original model, this method can also be migrated to most keypoint detection algorithms. In particular, it is only necessary to input the heat map predicted by the trained Lite-HRNet into the decoding method at the keypoint prediction stage, and the potential maximum activation position can be inferred, thus improving the accuracy of the model.

3 Results and analysis

3.1 Fish object detection results and analysis

The model converged after 50 cycles of model training. The performance of the model was tested in the underwater aquaculture pond; the results are shown in Fig. 13. As can be seen in Fig. 13, the YOLOv5 model trained with data augmentation using the copy-paste strategy achieves satisfactory fish detection results and can accurately locate the identifiable fish in the image.

Fig. 13
figure 13

Object detection results. The figures all show the detection results after data enhancement. The model obtains better detection results and can better locate the fish body

We further set up a control group experiment, which has the same initialization parameters, training platform, training period, and training dataset for the YOLOv5 model. This is to verify that the training dataset with copy and paste-enhanced quality can improve the detection ability of the model in complex scenes. As shown in Fig. 14, it is easy to produce an overfit phenomenon in complex scenes, misidentifying other objects as object fish after the training of the control group reaches convergence and runs on the test set. After applying the copy-paste strategy, the sensitivity against different categories in complex scenes is improved.

Fig. 14
figure 14

Copy-paste data enhancement effect comparison. The right figure shows the effect after enhancement. After enhancement, the sensitivity of object detection for different types of fish is higher in complex scenes, and it is not easy to misidentify other objects

In this study, mAP@0.5, mAP@0.75, and mAP@0.5:0.95 are used as indicators (Jiao et al. 2019) to evaluate the detection performance of the YOLOv5 model. Results are shown in Table 2. The baseline in the table indicates that only the official version of YOLOv5 is used for training. +BC indicates training with improved YOLOv5. +CP indicates the addition of data enhancement training based on this baseline. +BCP indicates the addition of data enhancement training based on +BC. The results presented in Table 2 show that the improved YOLOv5 and copy-paste strategies can considerably improve the accuracy and are applicable to the farm environment.

Table 2 Improved YOLOv5 object detection performance evaluation. The detection performance of the YOLOv5 model was evaluated using mAP@0.5, mAP@0.75, and mAP@0.5:0.95 as metrics

3.2 Performance evaluation of fish keypoint detection model

The Lite-HRNet model is assigned random values via He initialization (He et al. 2015). It converges after 50 rounds of training. The test data uses a local image of a single fish, which is output and clipped by the improved YOLOv5. The prediction threshold is 0.5, and the result for each keypoint is the most confident pixel in the heatmap. As shown in Fig. 15, part of the anal fin of the crucian carp above is covered, but it can still be identified relatively accurately. As can be seen in Fig. 16, the keypoint detection of images from real farm environments achieved a relatively good effect.

Fig. 15
figure 15

Examples of keypoint detection results. The figure shows the recognition results of the images from the network in the dataset

Fig. 16
figure 16

The figure is the result of keypoint identification for fish taken from a real aquaculture environment

Some experimental results are not satisfactory, as shown in Fig. 17. When the fish body is of a single color, determining the distribution of the keypoints in the eyes is difficult. When the identified fish body is distorted or obstructed, the recognition rate of keypoints in various parts decreases. In a scene with too many fish, some keypoints of the fish are missing, resulting in substantial information loss. For future work, we can consider utilizing the statistical counting method to supplement the key point information or combining the results of the front and back frames to make better estimations.

Fig. 17
figure 17

Two factors, the color of the fish and the body position being obscured, can affect the accuracy and results of keypoint detection

The model framework in this paper is built on the basis of a high-resolution network. It integrates multiple features of different resolutions and layers within the network. The model evaluation metric uses pair-wise node similarity (Ronchi et al. 2017) (OKS), defined as follows:

$$\begin{aligned} O K S_p=\frac{\sum _i \exp \left\{ -d_{p i}^2 / 2 S_p^2 \sigma _i^2\right\} \delta \left( v_{p i}=1\right) }{\sum _i \delta \left( v_{p i}=1\right) } \end{aligned}$$
(10)

In Eq. (10), \(d_{p_i}\) represents the Euclidean distance from a keypoint to the predicted point, and \(S_p\) denotes the object scale. \(\sigma _i\) represents the normalization factor of the i-th keypoint of the object, \(\delta\) is a function used to filter visible points, \(v_{p i}\) indicates whether the i-th keypoint of the i-th fish is visible, i is the id of the keypoint, and p is the id of the fish object instance.

Taking the average OKS as the keypoint detection index, the performance of different strategies is shown in Table 3. It can be seen that both the weight distribution and DARK help the accuracy of the model. There is an obvious increase in the value of the indicator as compared to the original network, DARK.

Table 3 Keypoint Detection Performance Evaluation. Using the average OKS as the keypoint detection indicator, the performance of different strategies is shown in the Table

4 Conclusions

This study addresses the problem of fish keypoint detection and locates seven predefined fish keypoints based on a detection-regression scheme. Results show that this algorithm can maintain good performance in complex underwater environments. The main conclusions of this study are as follows:

  1. (1)

    The accuracy and recall rate of this method in identifying fish objects and their keypoints are 70.1\(\%\) and 73.4\(\%\), respectively; the FPS is 20 frames per second. Thus, the keypoints of fish can be acquired in real-time by feeding fish video images.

  2. (2)

    A comparison of the current fish keypoint detection methods shows that the proposed method has clear advantages in terms of efficiency and accuracy. This can provide technical support and insights into the subsequent analysis of fish behavioral patterns and related studies.

Although the method presented in this paper performs well on the datasets in this study, it can be improved further. The object detection module and the keypoint detection module are separated in the diagram of the proposed method. In addition, the datasets for training the object detection module and keypoint detection module are distinct. Future endeavors could consider integrating the two modules into a single pipeline to realize end-to-end detection, thus improving the overall training efficiency of the model.