A detection-regression based framework for fish keypoints detection

Applying computer vision technology in aquaculture can improve the efficiency of fish detection and health monitoring as well as optimize aquaculture management and profit. Keypoints on fish bodies are important biological indicators that can be used to calculate the individual size, mass, and behavior. However, only a few relevant studies have been conducted in this regard, and they mainly focus on detecting keypoints for stereo matching. Traditional keypoint detection methods exhibit low efficiency, poor accuracy, and weak robustness in underwater environments. Accordingly, this study proposes a new method based on object detection and point regression models to locate fish keypoints. First, individual fish are detected by employing a commonly used object detection model, YOLOv5. The detection accuracy is further improved by enhancing the network neck. In the second stage, a deep learning model for locating fish keypoints is constructed by implementing weight allocation and distribution-aware strategy in the matched left and right bounding boxes to improve on the previous work of Lite-HRNet, which was originally designed for capturing human body keypoints. The experimental results show that the proposed method can effectively detect individual underwater fish and accurately estimate the keypoints. The source code and the labeled datasets for fish detection and keypoint location are provided. The code is available at https://github.com/oucvisionlabsanya/fish_keypoint_detection.git.


Introduction
Fish are a crucial part of people's diets owing to their tender flesh, high-quality protein, and other nutrients.However, in addition to the low level of automation, there is a lack of use of Artificial Intelligence in fish aquaculture.A substantial amount of manual intervention is constantly required in key aspects of fish aquaculture such as feeding and health monitoring.This manual method cannot monitor fish growth in real time, and there is a lag in the information obtained, making timely adjustments to the aquaculture procedures difficult.Computer vision technology has been widely used in recent years.The use of intelligent vision methods in fish aquaculture not only saves human and material resources but also provides more timely, accurate, and stable data, which considerably improves production efficiency and management.Capturing images in the aquaculture pond in real time through underwater cameras enables contactless collection of fish data, which can reduce the stress and physical damage to the fish population caused by traditional methods of data collection.By increasing the monitoring time to several years in the aquaculture period, fish farms can save considerably on labor costs and achieve better economic benefits.
Keypoints on fish bodies, such as certain points on the mouth, eye, anterior end of dorsal fin, posterior end of dorsal fin, and tail fin, are important biological indicators and can be used to calculate individual size, mass, and Page 2 of 12 Dong et al. Intelligent Marine Technology and Systems (2023) 1:9 behavior.Therefore, developing accurate keypoint detection methods for fish bodies is necessary.With the development of deep learning technology and the deepening of neural network models, the accuracy and efficiency of object detection have been considerably improved.This can enable more convenient and accurate services and facilitate a better understanding of the morphological features via recognition, location, and classification of objects in images or videos.However, locating fish keypoints has been a challenging problem, and few related studies are available in the literature.Recent studies have mainly focused on detecting keypoints for stereo matching (Suo et al. 2020;Lin et al. 2021).The large variability in fish morphology and the influence of factors such as light and shading make accurately locating fish keypoints difficult.Solving this problem can provide considerable benefits for fish production and conservation.
Currently, studies on fish keypoint detection remain scarce.Keypoint detection is more commonly applied in the fields of facial recognition and human pose estimation.Therefore, borrowing ideas from human pose or keypoint estimation methods can be useful with respect to fish keypoint detection.Traditional keypoint detection is achieved via feature operator filtering, template matching (Li et al. 2021), and corner point detection (Zhang et al. 2007).Yang et al. (2020) used machine learning technology to achieve fast detection of three-dimensional surface keypoints, which substantially improved the detection efficiency and accuracy.Wu and Guan (2019) successfully constructed an efficient face recognition model using face keypoint information combined with Long Short-Term Memory.Toshev and Szegedy (2014) conducted the first study on a single-object two-dimensional (2D) human pose and proposed a multistage, iterative, deep convolutional neural network structure based on coordinate attribution.Using the network for the feature extraction of keypoints followed by coordinate regression, iterations were performed to obtain the locations of keypoints and 2D pose estimation.However, because this method directly regresses the spatial locations of keypoints, model convergence becomes difficult.Consequently, the performance cannot be guaranteed, and the obtained results are difficult to migrate to other scenarios.Tompson et al. (2014) proposed a multistage convolutional neural network combined with a 2D pose estimation method and a Markov random-field model.This model uses a multistage cascade structure to extract features from keypoints and regress the feature information to obtain the location of keypoints.However, it still has issues regarding poor generalization performance due to coordinate regression.
The abovementioned two-dimensional human pose estimation algorithm in the network primarily uses a supervised training method for feature extraction of keypoints.A high-quality dataset for keypoints is required owing to differences in the spatial distribution and the number of keypoints for different objects.Moreover, migrating the learning by directly using other methods is difficult.When relevant training samples are created, the keypoints of the objects should be identified and labeled according to the spatial requirements.For rigid objects such as fixed-wing vehicles, a strong and stable correlation between the keypoints and pose is observed.For nonrigid objects, such as human bodies and fish, there are more keypoints, which can move freely.Therefore, the pose structure of nonrigid objects is variable.Consequently, the requirements for the size of the dataset and the benchmark of keypoint labeling when making training samples are highly demanding.Currently, keypoint training samples for posture predominantly target the human body.Papandreou et al. (2017) constructed a heat map-based regression model based on the object bounding box output from FasterR-CNN and obtained more accurate results.Newell et al. (2016) proposed a stacked hourglass network.Feature extraction is achieved at different resolutions through up-and down-sampling, and keypoint classification heat maps are output at the last level.The graph-to-graph fully convolutional network of UNet (Ronneberger et al. 2015) can also be used for keypoint detection.HRNet was proposed by Microsoft Asia Research Institute (Sun et al. 2019), which changed the convolutional network from a series structure to a parallel structure for the first time and considerably improved the ability of the network to extract information from feature maps with different resolutions.Bulat et al. (2020) found that some network layers in the keypoint detection network do not need to be residual-linked and proposed adding connection weights to the network layers to determine whether residual-linking is needed via learning.There are many methods or models for singleobject detection, but fish in a breeding pond often engage in clustering, resulting in many overlapping regions between the objects.Therefore, detecting the keypoints of the main objects of interest becomes difficult.
Considering the aforementioned problems, this study proposes a detection regression-based scheme for locating fish keypoints.The proposed approach essentially employs an optimized object detection algorithm (YOLOv5) and the human pose estimation network (Lite-HRNet) for locating fish keypoints, leading to the incorporation of more features without increasing the computational effort by improving the YOLOv5 neck network and loss function.The YOLOv5 model has the advantages of high accuracy and efficiency, while the Lite-HRNet model can effectively reduce the model size with high accuracy and efficiency.By improving Lite-HRNet with weight assignment and a distributionaware strategy, a deep learning model for fish keypoint detection is constructed, which enables the fish keypoint detection model to identify keypoints efficiently and accurately.

Fish keypoint detection
This section introduces the deep learning-based framework for locating fish keypoints.Figure 1 shows the overall structure of the framework.We also introduce two datasets used for training the two stages of our models.

Fish object detection 2.1.1 Fish detection dataset with augmentation
During the object detection phase, fish are the only objects to be detected.Our study was conducted in a marine aquaculture environment, and one label of fish was available at the time of creating the YOLO format dataset.Owing to the complex aquaculture environment, 3,000 fish images were obtained through manual photography and web search and used as the dataset, as shown in Fig. 2.These data were partitioned into a training set and test set with a ratio of 4:1.
Using deep learning methods for object detection, the proposed scheme has demanding requirements for datasets, including the number of data samples and the diversity of objects.We can achieve more satisfactory training results only by constructing diverse datasets.The high cost of acquiring farmed fish datasets makes it difficult to acquire datasets that are as good as those produced by professional teams, particularly in terms of the richness of the scenes.Therefore, only samples from different breeding pools of a single farm are used.Hence, considering the size and quality of the dataset, it is necessary to use data augmentation techniques to increase the quantity and richness of the data during the training phase to ensure that the model can achieve better results.
As shown in Fig. 2, the environment within the breeding ponds is relatively simple and generally contains only a relatively single-colored background and objects (fish).Directly using the images captured in the breeding pond as the input to YOLOv5 will result in a model that performs well with a single scene and degrades severely when the testing scene becomes complex.Furthermore, because there is only one type of label, the model is not sensitive to the distinction between other objects and fish.Consequently, it can potentially produce false positives once other objects that are not in the dataset are present.
To ensure the robustness of YOLOv5 object detection and to address the challenge of object distinction insensitivity, the copy-paste strategy (Ghiasi et al. 2020) can be used for data enhancement to introduce objects from other scenarios.Simultaneously, the number and richness of the dataset can be expanded.
The core idea of the copy-paste method is to take instances from the original image and paste them into another image to address the occlusion problem in semantic segmentation by artificially and intentionally creating overlaps and using such sets for training.
In this study, 100 images of various scene styles are collected in addition to moving the object to a certain extent, as shown in Fig. 3.As an additional dataset background, 20% of the samples are randomly selected for cutting and superimposing fish images into new scenes (Fig. 4) so that they are contrasted against other objects in the scene.Thus, the algorithm can learn the difference between the fish and other objects in complex scenes and improve the feature recognition capability of the model.In addition to the traditional methods, we use some image quality augmentation methods, such as random flip, sharpness change, contrast adjustment, saturation change, and color change to improve the quality of modeling, as shown in Table 1.

Fish object detection based on improved YOLOv5
To accurately locate keypoints on fish bodies, we first need to detect individual fish and then find the keypoint locations.The focus structure uses slicing techniques to reduce the resolution of the input image, thus reducing the computational effort.In addition, the neck part uses feature pyramid network (FPN) and path aggregation network (PAN) structures to achieve multiscale feature fusion, considerably improving the efficiency of small object detection.FPN is a network structure for solving multiscale problems in object detection.It can effectively reduce the computational effort and improve the small object detection performance by adjusting the network hierarchy, as shown in Fig. 5. PAN improves the expressive ability of FPN by introducing a bottom-up feature fusion path; the structure of PAN is shown in Fig. 6.
PANet simply adds the fusion path from the bottom to the top when performing feature fusion and does not consider the different contributions of different features.In the farming environment, there are color and size differences among different types of fish species and different preferences for the extracted features.To address this problem, EfficentDet-BiFPN is used instead of PANet.BiFPN adds paths with added contextual information on top of FPN.Each path corresponds to a weight to achieve multiscale feature fusion.The structure diagram is shown in Fig. 7. BiFNP exhibits considerable improvements and advantages over PANet: (1) BiFNP eliminates the node with only one input edge, considerably simplifying the bidirectional network.
(2) When the input and output nodes are at the same level, BiFNP can incorporate more features by adding another edge.(3) BiFNP treats each bidirectional pathway as a feature layer, which achieves a higher level of feature fusion.

Locating fish keypoints
Once we have achieved individual fish object detection, we can locate the keypoints of each part of the fish.This section first introduces the definition of keypoints and then presents the implementation of fish keypoint detection.

Fish keypoints dataset
As shown in Fig. 8, most farmed fish have the following seven structures on which keypoints can be selected, including mouth, eye, anterior end of dorsal fin (dorsal fin1), posterior end of dorsal fin (dorsal fin2), tail fin, anal fin, and pelvic fin.The parts represented by these 7 keypoints exist on almost all fish, and they can encompass the whole body of the fish, which is more representative of the fish's posture when swimming and is also beneficial to the accuracy of the subsequent reprojection calculations.
We constructed a dataset containing 2000 images for this study, some of which were taken from Hainan Chenhai Aquatic Co. and others from the Internet.First, the coordinates of the detection frame were output by the object detection module of the YOLOv5 network.After cropping according to the coordinates and obtaining a single image of the fish body, we used the open-source image annotation tool VGG image annotator to annotate the keypoints of the fish and obtain 2000 images containing information regarding the keypoints of the fish body.Representative labeled keypoint data is shown in Fig. 9.

Locating fish keypoints
To accurately determine the locations of fish keypoints and improve the efficiency of keypoint detection while reducing sensitivity to low-resolution images, we drew inspiration from the work of the Lite-HRNet network, which was originally designed for locating keypoints on human bodies.However, we improve the network from two perspectives: (1) Keypoints are assigned corresponding weights.
Essentially, the Lite-HRNet uses the Shuffle block in the lightweight backbone network of the previous work of ShuffleNet to replace the modules in HRNet (Zhang et al. 2018).The structure of the Shuffle block is shown in Fig. 10.Since HRNet performs information exchange of parallel subnetworks using 1 × 1 convolution, the traversal calculation of each point of the feature map becomes a computational bottleneck of the network.Therefore, Lite-HRNet proposes channel weighting (Conditional Channel Weighting, CCW) to replace the 1 × 1 convolution, as shown in Fig. 10.In this study, Lite-HRNet-30, based on an input scale of 384 × 288, is used as the keypoint detection network to achieve good accuracy.
(1) Improved Lite-HRNet network based on weight distribution The keypoint detection model in this paper needs to detect 7 keypoints on various parts of the fish body.In a real breeding environment, some fish may have smaller pelvic fins, and some fish may have narrower dorsal fins.Consequently, if the keypoints of each part were to have the same weight, it would lead to poor detection performance for parts with less pixel information.Mean- First of all, DARK assumes that the final output 2D coordinates of the keypoint detection conform to the two-dimensional Gaussian distribution as shown below: (1) Taking the logarithm, we get: Let us draw Eq. ( 2) as Fig. 12.It can be seen from the figure that limiting the resolution will change the position of the original keypoint from point µ to point m.The correct keypoint position should be at the maximum value of point µ , where the derivative is 0.
At the same time, Eq. ( 2) is expanded using the Taylor series to get: Eq. ( 4) is derived to obtain: Substituting into Eq.( 5), we get: Transform Eq. ( 6) to get: It can be seen from Eq. ( 7) that the point can be obtained by the first-order and second-order derivatives of D(m) , and the second-order derivative of the point can be obtained by Eq. ( 8). (2) (3) (4) Arranging the above formulas, we get: µ can be obtained using the above formula, and the orig- inal keypoint coordinates can be corrected.Exploration of the distribution statistics in the heat map can result in more accurate final 2D keypoint coordinate results.More importantly, this decoding method only requires the calculation of the first-order and second-order derivatives and, thus, does not involve a particularly high computational effort.As this improved distribution-aware method requires almost no structural changes to the original model, this method can also be migrated to most keypoint detection algorithms.In particular, it is only necessary to input the heat map predicted by the trained Lite-HRNet into the decoding method at the keypoint prediction stage, and the potential maximum activation position can be inferred, thus improving the accuracy of the model.

Fish object detection results and analysis
The model converged after 50 cycles of model training.
The performance of the model was tested in the underwater aquaculture pond; the results are shown in Fig. 13.As can be seen in Fig. 13, the YOLOv5 model trained with data augmentation using the copy-paste strategy achieves satisfactory fish detection results and can accurately locate the identifiable fish in the image.After enhancement, the sensitivity of object detection for different types of fish is higher in complex scenes, and it is not easy to misidentify other objects Table 2 Improved YOLOv5 object detection performance evaluation.The detection performance of the YOLOv5 model was evaluated using mAP@0.5,mAP@0.75, and mAP@0.5:0.95 as metrics mAP@0.5 mAP@0.75mAP@0.We further set up a control group experiment, which has the same initialization parameters, training platform, training period, and training dataset for the YOLOv5 model.This is to verify that the training dataset with copy and paste-enhanced quality can improve the detection ability of the model in complex scenes.As shown in Fig. 14, it is easy to produce an overfit phenomenon in complex scenes, misidentifying other objects as object fish after the training of the control group reaches convergence and runs on the test set.After applying the copy-paste strategy, the sensitivity against different categories in complex scenes is improved.
In this study, mAP@0.5, mAP@0.75, and mAP@0.5:0.95 are used as indicators (Jiao et al. 2019 2 show that the improved YOLOv5 and copy-paste strategies can considerably improve the accuracy and are applicable to the farm environment.

Performance evaluation of fish keypoint detection model
The Lite-HRNet model is assigned random values via He initialization (He et al. 2015).It converges after 50 rounds of training.The test data uses a local image of a single fish, which is output and clipped by the improved YOLOv5.The prediction threshold is 0.5, and the result for each keypoint is the most confident pixel in the heatmap.As shown in Fig. 15, part of the anal fin of the crucian carp above is covered, but it can still be identified relatively accurately.As can be seen in Fig. 16, the keypoint detection of images from real farm environments achieved a relatively good effect.Some experimental results are not satisfactory, as shown in Fig. 17.When the fish body is of a single color, determining the distribution of the keypoints in the eyes is difficult.When the identified fish body is distorted or obstructed, the recognition rate of keypoints in various parts decreases.In a scene with too many fish, some keypoints of the fish are missing, resulting in substantial information loss.For future work, we can consider utilizing the statistical counting method to supplement the key point information or combining  the results of the front and back frames to make better estimations.
The model framework in this paper is built on the basis of a high-resolution network.It integrates multiple features of different resolutions and layers within the network.The model evaluation metric uses pair-wise node similarity (Ronchi et al. 2017) (OKS), defined as follows: In Eq. ( 10), d p i represents the Euclidean distance from a keypoint to the predicted point, and S p denotes the object scale.σ i represents the normalization factor of the i-th keypoint of the object, δ is a function used to filter visible points, v pi indicates whether the i-th keypoint of the i-th fish is visible, i is the id of the keypoint, and p is the id of the fish object instance.
Taking the average OKS as the keypoint detection index, the performance of different strategies is shown in Table 3.It can be seen that both the weight distribution and DARK help the accuracy of the model.There is an obvious increase in the value of the indicator as compared to the original network, DARK.

Conclusions
This study addresses the problem of fish keypoint detection and locates seven predefined fish keypoints based on a detection-regression scheme.Results show that this algorithm can maintain good performance in complex underwater environments.The main conclusions of this study are as follows: (1) The accuracy and recall rate of this method in identifying fish objects and their keypoints are 70.1% and 73.4% , respectively; the FPS is 20 frames per second.Thus, the keypoints of fish can be acquired in real-time by feeding fish video images.(2) A comparison of the current fish keypoint detection methods shows that the proposed method has clear advantages in terms of efficiency and accuracy.This can provide technical support and insights into the subsequent analysis of fish behavioral patterns and related studies.
Although the method presented in this paper performs well on the datasets in this study, it can be improved further.The object detection module and the keypoint detection module are separated in the diagram of the proposed method.In addition, the datasets for training the object detection module and keypoint detection module are distinct.Future endeavors could (10) OKS p = i exp −d 2 pi /2S 2 p σ 2 i δ v pi = 1 i δ v pi = 1 consider integrating the two modules into a single pipeline to realize end-to-end detection, thus improving the overall training efficiency of the model.

Fig. 1 Fig. 2
Fig. 1 Framework of study.After applying the optimized YOLOv5 object detection algorithm, an improved Lite-HRNet network is applied to the fish keypoint location Fig. 3 Partial scene diagram.Hundreds of images of various types of scene styles were collected as additional dataset backgrounds

Fig. 4
Fig.4Scene graph processed via copy-paste.Randomly selected 20% samples of fish are cropped and superimposed onto the new scene, placing the fish in different scenes so that they are in clear contrast to other objects in the scene

Fig. 5 Fig. 6 Fig. 8
Fig. 5 Structure of FPN.FPN is a network structure for solving multiscale problems in object detection

Fig. 10
Fig. 10 Lite-HRNet basic module structure.Lite-HRNet uses the Shuffle block from the lightweight backbone network ShuffleNet instead of the module in HRNet.The structure of the Shuffle block is shown above

Fig. 13
Fig. 13 Object detection results.The figures all show the detection results after data enhancement.The model obtains better detection results and can better locate the fish body

Fig. 15
Fig. 15 Examples of keypoint detection results.The figure shows the recognition results of the images from the network in the dataset

Fig. 16
Fig.16The figure is the result of keypoint identification for fish taken from a real aquaculture environment

Table 1
Data enhancement types and parameters

5:0.95
) to evaluate the detection performance of the YOLOv5 model.Results are shown in Table 2.The baseline in the table indicates that only the official version of YOLOv5 is used for training.+BC indicates training with improved YOLOv5.+CP indicates the addition of data enhancement training based on this baseline.+BCP indicates the addition of data enhancement training based on +BC.The results presented in Table

Table 3
Keypoint Detection Performance Evaluation.Using the average OKS as the keypoint detection indicator, the performance of different strategies is shown in the Table