Research on fish identification in tropical waters under unconstrained environment based on transfer learning

Seafloor unconstrained environment video is shot in uncontrolled open sea area. There are multiple backgrounds, complex illumination and weather changes, and rapid growth of algae and attached lenses, which affect the stability of video shooting quality, resulting in difficulty in image recognition. At present, there is still no algorithm that is superior to other methods generally, and it is necessary to build a model combined with specific scenes and applications. In this paper, a fish identification method in tropical waters based on transfer learning under unconstrained environment is proposed. Firstly, the image is pre-processed by affine transformation to realize data enhancement. Furthermore, RestNet50 deep convolutional neural network is constructed based on transfer learning to compare the effect of fish recognition before and after transfer learning. The results show that, the accuracy and loss indicators are better than those of non-transfer learning when the trained model of imagenet is introduced as the initial weight of the network. When the model is trained to 150 epochs, the indicators begin to converge, which can better complete the fish identification task in tropical waters under unconstrained environment.

China is one of major marine countries in the world with an area of more than 3 million square kilometers and rich marine resources. However, compared with developed countries, the development and utilization of marine resources in China still needs to be further improved. Fish identification is the first step in the investigation of marine fish resources and an important basis for the effective development and utilization of marine biological resources. Tropical fishes usually have bright appearance and are one of the most energetic marine organisms. Their growth status, species richness, and activity trajectory can directly or indirectly reflect the biodiversity of the waters. Scientists can also understand the underwater environment, the behavior of different marine animals and their mutual influence through the analysis of fish resources. Thus, the classification and identification of fish in tropical waters has important strategic significance for the effective development and utilization of fishery resources and marine environmental protection in China.
In the past, the monitoring of marine fish resources was often carried out by field sampling method (Li et al. 2011), fishing method (Hodgson 2001) and hydroacoustics telemetry. These methods have some shortcomings such as dependence on expert experience, destruction of biological resources and inability to determine the type of fish (Yang 2015). Thanks to the improvement of data processing technology, evaluation model and method, monitoring and evaluation of marine biological resources step into the age of precision and nondestructive (Zhang and Zhu 2017). The establishment of underwater video deep learning system will be the most effective method for intelligent monitoring of marine fish resources.
As a non-destructive technique, computer vision capture and analysis technology can maintain the integrity of marine biological resources and bring the least environmental disturbance in monitoring, which is widely used in marine biological monitoring research. Seabed images are acquired by photography and video and ocean biological characteristics are studied by image processing (Evans 2003;Hsiao et al. 2013;Zhang and Zhu 2018;Zhang et al., 2019Zhang et al., , 2020. Compared with traditional methods, using computer technology for marine biological monitoring is an automatic, non-invasive, economic and effective method. Although the image-based target detection technology has been widely studied in recent years, the target detection technology based on images in seabed unconstrained environment still faces great challenges. Due to the fast and free movement of fish, weather, illumination and other external factors change frequently (Toh et al. 2009), resulting in low contrast and fuzzy texture details of video images, which is not conducive to train network to extract image features.

Fish image recognition algorithm
The basic process of fish image recognition: input fish image, select fish feature, build classifier, classification and recognition. Early methods over-rely on artificial selection features, which depend mainly on expert prior knowledge, through fish contour, shape, color and other features to detect (Zion et al. 2008;Chomtip et al. 2016;Alsmadi et al. 2016), the number of parameters allowed in feature design is fairly limited, and it is difficult to accurately obtain high-level features (Hsiao et al. 2013), thus affecting the accuracy of classification and recognition results. With the development of artificial intelligence technology, deep learning has achieved great success in the field of computer vision. The traditional machine learning method is gradually being replaced by the method based on deep learning. Deep learning can automatically learn the representation of features from big data and contain thousands of parameters. Mapping data from pixel level low dimensional features to semantic level high dimensional features through deep neural network structure, which makes it have prominent advantages in extracting the global features and context information of the image (Yu et al. 2013) and brings new ideas to solve the traditional computer vision problems (such as image segmentation and key point detection).
After 2012, especially the neural network model represented by Convolutional Neural Network (CNN) has been widely used (Frederic et al. 2015;Xu et al. 2015;Li et al. 2019). CNN is a popular deep learning architecture composed of alternating convolution layer and sub-sampling layer, since it can extract complex and high-dimensional image features, it has been proved to be effective in many computer vision applications, such as segmentation (Girshick et al. 2014;Ciresan et al. 2012), object recognition and classification (Szegedy et al. 2013), target tracking (Zhou et al. 2015). In 2017, Google and NASA found a new planet in Kepler-90 massive images through CNN technology. Frederic et al. (2015) also found endangered species of manatee from hundreds of thousands of satellite images through CNN. Spampinato et al. (2012) proposed an algorithm for tracking fish in unconstrained environment, the detected object was described based on the covariance model, and the covariance matrix of the current object was compared with that of the candidate object. The trajectory was generated by template matching, and the average accuracy was about 80%, which needs to be further improved. To solve these problems, Chuang et al. (2015) further proposed a tracking algorithm based on deformable multi-core. Through the mean shift algorithm of color and texture features, the motion of the kernel was effectively estimated and the fish tracking was realized. However, the specific test results were not given in this paper.
On this basis, Jager et al. (2017) realized fish tracking in unconstrained environment with activated CNN through twostage graph method, which has high tracking accuracy and efficiency. However, the 14 super parameters of the model cannot be adjusted automatically, thus reducing the portability of the model. Wang et al. (2017) customized CNN for the head image of each fish in the video frames and adaptively updated CNN parameters in the tracking process to adapt to slight illumination and fish appearance changes, this method has been proved to have good recognition and tracking effect for zebrafish, but the experiment is still limited to the aquarium environment.

Data sources
As the artificial fish reef has a good recovery effect on the marine resources of Wuzhizhou Island, the fish population is relatively dense in this area, and the resources are relatively abundant, so we selected sea areas where the artificial reefs were placed at Wuzhizhou island in the early stage for monitoring, in order to identify more tropical fish in the unconstrained marine environment. Since 2017, the project team has selected five artificial reefs, as presented in Fig. 1, to film and observe marine fish around Wuzhizhou Island in Sanya City, Hainan Province.The location, water depth and artificial reef type of these five monitoring points are shown in Table 1.
The underwater electronic cabin is arranged in five observation points, and two photoelectric HB-HM-02 underwater cameras are arranged in one cabin for data acquisition from different angles. Video resolution is 648 × 480, width and height, frame rate is 50. At present, all five sets of equipment have been arranged and operated well, and continuous and uninterrupted recording and transmission of video data and water quality monitoring data have been realized.
Ten cameras generate 150 + MB video data per minute (the camera does not shoot video during poor night illumination). Under the monitoring of 12 h, more than 100 GB video data are generated every day and 35 + TB video data are generated every year. This area has rich fish biological resources, which provides a solid guarantee for the data source of this paper. In addition, the Fish4Knowledge large underwater video library derived from LifeCLEF 2015 fish task 1 (Fisher et al. 2016) is also used as the experimental data. The video library contains more than 700,000 submarine videos, mainly fish images taken from the underwater landscape platforms of Taiwan' s Nanwan Strait, Lantau Island and Hubi Lake, the videos are taken from September 30, 2013 to October 1, 2019.
Technical information on the above two datasets are presented in Table 1 below. The LCF-15 dataset has been marked by experts, with a total of 42,778 marked video frames. Selected 24 video clips of different time periods in Wuzhizhou Island database, a total of 3100 labeled video frames. 45,878 video frames were marked totally, including 32,290 for training set and 13,588 for test set (Table 2). Seabed video reflects the surrounding marine environment, marine fish color, shape, type and other information. Figure 2 is shown the video frame in LCF-15 database.
The video frame effect of Wuzhizhou Island dataset is shown in Fig. 3, which marks 3100 video frames. Table 3 shows the information and video maps of some species appearing in the above dataset. The comparison shows that the image quality is low in the seabed unconstrained environment, and it is difficult to obtain high performance for image detection and classification tasks.
Data pre-processing

Convert video to picture
The traditional remote video surveillance system requires large hardware and software resources, needs high bandwidth and has long delay for data transmission (Xiao and Cai 2015). The seabed video of Wuzhizhou Island is compressed using H. 264 coding standard. On the basis of retaining the advantages of the original video coding, it has the advantages of

Data enhancement
In order to obtain more image samples, expand the sample size, and improve the accuracy and generalization ability of the model, this paper mainly uses ImageDataGenerator based on Keras image processing to enhance the original training data. This method mainly expands the data set by rotating the image between −40°and 40°, zooming in the proportion of 0.2-1 / 0.2, shifting / clipping 0-0.2 percentage along the horizontal or vertical direction, and filling the nearest neighbor with dew. The processing flow is shown in Fig. 5 and the data enhancement effect is shown in Fig. 6.

Transfer learning
Data is the key of deep learning, and having enough available training samples is the premise of computer learning to obtain good classifier. Most of the available data are designed for specific subjects, such as handwritten digital database MINIST, image dataset ImageNet, image semantic understanding dataset COCO and so on. These open source data cannot fully meet the requirements of deep learning training data required in this study, so it is necessary to supplement the training data for specific fish. However, manual image annotation is very time-consuming, it takes an average of 1.5 h to label a single image in the driverless dataset Cityscapes at the fine pixel level. In addition, mislabeling are prone to occur due to subjective factors. Therefore, the project introduces transfer learning to solve the problem that there is only a small amount of labeled sample data in the current target area. Transfer learning is a new machine learning method that uses trained knowledge to solve problems in different but related fields, so as to solve the problem of only a small number of labeled sample data in the target field. In the past, deep learning models need to be retrained for different fields. On the one hand, it takes a lot of time and computing resources. On the other hand, in the field of image recognition, the underlying features of the image are highly similar, and combined into different and more complex features for image classification and recognition. Accordingly, the network model after pre-training of other large training sets can be used as the underlying image feature extractor (Dai et al. 2007;Lawrence and Plat 2004;Lee et al. 2007), so as to realize the transfer of general image feature extraction knowledge to the field of tropical fish classification and recognition, and reduce the dependence on labeled data. " Transfer learning The ImageNet image dataset, which began in 2009, is the world 's largest image recognition database currently, it was built by Stanford computer scientists simulate human recognition system. Including about 15 million pictures, 22,000 types, Each picture has been rigorously screened and marked manually. Since 2010, the large-scale visual knowledge challenge (ILSVRC) around the world has been held annually based on ImageNet. Research teams around the world evaluate their algorithms on a given dataset and compete for higher accuracy in visual recognition tasks. After Although ResNet50 is deeper than VGG16 and VGG19, weights is smaller. In this paper, the ResNet50 network in the Keras library is used to initialize training target model parameters by using the trained imagenet as the weight, so as to realize the sharing of learned knowledge to the new model. Then, the output layer is adjusted according to the target task, and the network model is trained by using the tropical fish image samples to improve the accuracy and validity of the detection algorithm. Finally, the fish classification

Model architecture
The entire ResNet deep convolution neural network consists of the initial convolution (Conv) layer, several Residual Blocks and the final fully connected layer (FC). The first Conv layer does not require normalization and ReLU nonlinear activation. Conv layer in Residual Block normalizes images according to mean and standard deviation, taking into account the training time of gradient descent, the ReLU nonlinear activation which is several times faster than the equivalent tanh unit is used as the activation function (Hinton et al. 2012). Then the first 3 × 3 convolution operation is performed, and the convolution results are batch normalization again. After ReLU nonlinear activation, the second 3 × 3 convolution operation is performed. Implement a long jump connection, pass the activation of the first layer quickly to the next Residual Block after the sixth layer. Total training times (train steps) of the model is 300, the image is convolved by a series of 3 × 3 filters, the convolution sliding step is fixed to 1, and the algorithm performance is optimized by mini _ batch. Considering the limited memory, the batch _ size of training set and test set are both set to 24, use padding size to fill the same image. The total parameters of the model are 2358 7712, the fixed parameters are 53,120, and the training parameters are 135 34,592. The model structure and parameter setting are shown in Fig. 8.

Experiments and results analysis
Undersea video data in LCF-15 datasets were acquired from south of Taiwan coral reef areas.The marine climate of Hainan and Taiwan is remarkable, and both have tropical climate. Due to its proximity to the equator and gentle terrain, Hainan has a tropical monsoon climate with long summer and short winter.
Taiwan has high latitude, complex terrain, large terrain drop, and tropical monsoon climate in the south, which is similar with Hainan province. In order to avoid the difference of fishery resources caused by the difference of natural conditions between the two places, this study selected the dominant fish species in Taiwan and Hainan to make the data set. There are mainly 9 categories as follows: Abudefduf vaigiensis,Coris  In order to evaluate transfer learning, we conducted experiment on LCF-15 and Wuzhizhou Island datasets, and compared the learning indicators before and after transfer learning with the network structure, learning efficiency and loss function unchanged. An Intel (R) Xeon (R) X5675 3.06 GHz computer system processor, 32 GB memory and NVIDA GeForce GTX 1050 Ti GPU were used in the experiment.
Train loss represents the loss of training data. Cross entropy is used as a loss function to measure the fitting ability of the training set. The lower the train loss value, the higher the fitting ability of the model on the training set. Validation loss represents the loss of the validation set. Cross entropy is used as the loss function to measure the fitting ability of the model on the unknown data, that is, the generalization ability of the model. The lower the validation loss value, the higher the generalization ability of the model. It can be seen from Fig.  9 that under the introduction of the imagenet trained model as the initial weight of the network, the train loss value and the validation loss value decrease rapidly, and with the increase of training times, these two value maintain a continuous downward. After 150 times of training, the validation loss value oscillates between 0.41 and 0.37. With the deepening of training, the loss value further oscillates downward, indicating that this method has good fitting ability and generalization ability.
Train accuracy represents the accuracy on the training data, and validation accuracy represents the accuracy on the validation set. The higher the accuracy is, the higher the accuracy of the model for tropical fish classification and recognition is. It can be seen from Fig. 10 that under the introduction of imagenet trained model as the initial weight of the network, the values of train accuracy and validation accuracy increase rapidly, and with the increase of training times, the values of train accuracy and validation accuracy keep increasing trend. After 150 times of training, the validation accuracy value oscillates between 0.85 and 0.90. With the deepening of training, the accuracy value further oscillates upward, indicating that this method has good recognition accuracy.
While maintaining the same network structure and other hyper parameters, this paper compares the changes of train loss and validation loss indexes before and after using transfer learning. From Fig. 11, it can be seen that the train loss values decrease to about 0.2-0.3 with the increase of training times before and after using transfer learning. At the initial stage of training, the loss will decrease significantly. After about 100 training sessions, the train loss will basically decrease to below 30%, and the validation loss value will oscillate between 0.25 and 0.45. Loss after transfer learning is significantly lower than before transfer learning.
In the case of maintaining the same network structure and other hyper parameters, this paper compares the changes of train accuracy and validation accuracy indexes before and after using transfer learning. It can be seen from Fig. 12 that before and after using transfer learning, train accuracy values increased to about 0.9 with the increase of training times. In the initial stage of training, accuracy will have a significant increase. When the training is about 50 times, train accuracy is basically stable above 0.8. The accuracy after transfer learning is obviously higher than that before transfer learning.
This paper takes tropical fish in the marine video surveillance data under unconstrained conditions as the research object, constructs a tropical fish recognition model based on transfer learning with ResNet50 as the basic training network. The results show that, transfer learning has a good effect on the computer recognition of tropical fish in unconstrained environment, which is helpful to improve the recognition accuracy and shorten the learning time. In the future, we will further obtain ultrahigh-definition video data and optimize the algorithm to solve the problem of convergence to local optimum rather than global optimum. At the same time, we will improve the hardware conditions, further increase the processing accuracy.