Keywords

1 Introduction

In the last few years, underwater video cameras are extensively used in scientific, industrial and military fields for exploring and studying underwater environments. Marine biologists are interested in using underwater video analysis to study fish populations as species richness and size measurement [1,2,3,4,5], abundance [6] or animal behavior [1]. Automatic processing is an advantage compared to manual processing which is relatively off putting task, subjective, time consuming and costly. Automatic fish classification can be divided into two parts. (1) Fish detection which aims to detect and separate the subject from the background. (2) Fish recognition which aims to identify the species of the detected fish. The underwater environment presents a lot of difficulties and poses great challenges for computer vision. The luminosity changes frequently, the visibility is limited and the background can change rapidly due to moving aquatic plants. There are some attempts to improve image contrast and resolution for underwater images [7, 8]. In addition, in fish recognition task, the fish can move in three dimensions, it can also hide behind rocks and algae. We also encounter the problems of fish overlapping and of the similarity in shape and patterns among fish of different species. In this paper, we will focus on fish recognition in underwater video images.

Convolutional neural networks (ConvNets) [9] consist of L learned layers. The first layer is the input layer and represents a raw image. The hidden layers typically consist of convolutional layers, pooling layers, normalization layers and fully connected layers. The output layer consists of N-dimensional vector where N is the number of classes, this layer uses Softmax function to predict a single class of N mutually exclusive classes. ConvNets are trained using a standard error-backpropagation algorithm.

Li et al. [2] applied Fast R-CNN (Regions with Convolutional Neural Networks) on underwater images to detect and recognize fish species. They achieved an accuracy of 81.4% on LifeCLEF 2014 dataset that contains 24277 fish images of 12 species. Choi [11] participated in LifeCLEF 2015 task for detecting and identifying fish in underwater videos and achieved the best performance of 81% in this task [12]. He detected fish by using background subtraction and a selective search strategy [13]. Then, he used the GoogleNet [14] based on convolutional neural networks to classify fish species. Qin et al. [3] used convolutional neural networks on the Fish Recognition Ground-Truth dataset consisting of a total of 27370 fish images of 23 species and they reached an accuracy of 98.57%. Qin et al. [4] used also deep architecture to extract the features of fish images. In their architecture, two convolutional layers use Principal Component Analysis (PCA), the non-linear layer uses a binary hashing and the feature pooling layer uses a block-wise histograms. Then, information invariant to large poses are extracted by using spatial pyramid pooling (SPP). Finally, they use a linear SVM classifier for the classification. Despite they introduced hand-crafted layers they have improved marginally the accuracy by 0.07%. Sun et al. [5] extracted features from underwater images by applying two deep learning architectures, PCANet [15] and Network In Network (NIN) [16]. For classification, they used a linear SVM classifier. They tested their model on a database of 15 species and obtained an accuracy of 69.84% with the NIN architecture and 77.27% with the PCANet architecture. Salman et al. [17] created a deep architecture of three convolutional layers to extract features, then they combined the features from multiple layers of the network to feed standard classifiers like SVM and KNN. They achieved an accuracy of 96.75% on test set of 7500 fish images issue from LifeCLEF 2015 Fish dataset.

Fig. 1.
figure 1

The proposed approach based on the pretrained AlexNet network with transfer learning technique for fish recognition.

Learning deep architectures from scratch necessitates a large dataset because of the huge number of weights to be trained. Available underwater datasets are of small size for learning ConvNets for underwater fish recognition. To overcome this problem, we introduce transfer learning framework [18] to train ConvNets from pretrained networks that could be trained on large datasets. AlexNet [10], GoogleNet [14], VGG [19] and ResNet [20] are some examples of pretrained models that have emerged in this field last few years. In this paper, we propose to transfer the learned weights from AlexNet model to a deep ConvNet for fish recognition in the open sea. We extract the fish features from images of the underwater dataset before and after fine-tuning the model and classify the input images with a linear SVM classifier on the extracted features. We choose AlexNet because this model needs less resources, is faster and has a simple architecture than others networks like GoogleNet (22 layers deep) and VGG (at least 16 convolutional layers) that make fine-tuning the transferred weights difficult especially with limited training data.

The paper is organized as follows: in the next Sect. 2, we describe the proposed approach for live fish recognition based on pretrained model. Then, the Sect. 3 details the experimental scheme and we give a comparative study evaluating the proposed approach on the Fish Recognition Ground-Truth dataset (c.f. Fig. 2). Finally, conclusions and perspectives are given in Sect. 4.

2 Proposed Approach

Available underwater datasets for fish recognition are too small for training deep ConvNets from scratch with random initialization. Moreover, deep learning requires immense resources of memories and processors. To overcome the difficulties imposed by limited training data, we use trained weights of AlexNet to extract fish features from images by removing some layers of the model and then using the rest of the network as a fixed feature extractor for our data. In order to demonstrate the effectiveness of fine-tuning approach, we extract features before and after fine-tuning. Fine-tuning algorithm consists of retraining the classifier on top of the network on the underwater image set and fine-tune the weights of the AlexNet via back-propagation. Finally, we will propose three schemes for classification. These schemes will be detailed below.

Fig. 2.
figure 2

Sample images of 23 fish species in Fish Recognition Ground-Truth dataset.

2.1 Architecture of AlexNet

As shown in the Fig. 1, AlexNet [10] has five convolutional layers. The number of filters and their size in these layers are 96 filters of size \(11 \times 11 \times 3\), 256 filters of size \(5 \times 5 \times 48\), 384 filters of size \(3 \times 3 \times 256\), 384 filters of size \(3 \times 3 \times 192\) and 256 filters of size \(3 \times 3 \times 192\) respectively. It has also three fully-connected layers with 4096 neurons in the two first layers and the last one has 1000 neurons. AlexNet has been trained over 1.2 million images from the ImageNet dataset [21]. It can classify images into 1000 categories of objects (such as keyboard, mouse, coffee cup, pen and many animals).

2.2 Input Images

First, we eliminate the background of images using the fish masks given in the dataset (c.f. Fig. 3). These masks are generated by Qin et al. [22] who proposed a foreground extraction method for underwater videos based on sparse and low-rank matrix decomposition. Then, we use foreground fish images as input training images after a resizing to the same size \(227 \times 227 \times 3\).

2.3 Feature Extraction and Classification

We first employ the AlexNet model without any fine-tuning to extract learned features by removing the output layer fc8 and using the value outputs of fc7’s layer as feature descriptor for each fish image, we denote this scheme by Alex-SVM.

Fig. 3.
figure 3

Example of foreground extraction. (a): original image. (b): fish mask. (c): fish foreground.

It is recommended to fine-tune the model, especially, when data similarity is very low between the original data and the new data. As the dataset used in this work is totally different from ImageNet, we will retrain the AlexNet model on the underwater dataset. We initialize a new fully-connected layer to replace the old one fc8 with a random values of 23 outputs corresponding to the 23 species in the considered dataset. Then, we retrain only this layer and we keep the weights of the lower layers. This is because the features captured in the lower layers are universal like edges and curves that are also pertinent to our task. We get a new AlexNet model retrained, we denote this scheme by Alex-FT-Soft.

Finally, we use the retrained AlexNet model to re-extract fish features from images as in the first one. We denote this last scheme by Alex-FT-SVM.

Table 1. The fish species distribution in the Fish Recognition Ground-Truth dataset.

3 Experimental Results

The performances evaluation of the proposed system is carried out on the Fish Recognition Ground-Truth datasetFootnote 1. We implement the algorithms in Matlab and use MatConvNetFootnote 2 library for training the deep ConvNet.

The Fish Recognition Ground-Truth dataset is an underwater live fish image dataset acquired from a live video dataset made by the European project Fish4-KnowledgeFootnote 3 (F4K) [23]. The dataset contains 27370 fish images and their fish masks of 23 different species. The fish species are manually labeled by following instructions from marine biologists. The dataset is imbalanced in the number of different fish species where the number of the most frequent species is about 1000 times more than the least one. The Fig. 2 shows examples of the 23 fish species and Table 1 shows the distribution of the fish species in the dataset.

Table 2. Comparison of fish recognition performances of various methods on the Fish Recognition Ground-Truth dataset.

We use 7-Fold Cross-Validation in order to estimate the performance of the proposed approach [4]. The results of classification using the three proposed schemes are given in the Table 2 in terms of accuracy and precision.

Fig. 4.
figure 4

Visualization of the first convolutional layer filters of AlexNet [10] (96 filters of size \(11\times 11\times 3\)), DeepFish [4] (32 filters of size \(5\times 5\times 3\)) and DeepCNN [3] (64 out of 72 filters of size \(5\times 5\times 3\)) (Color figure online)

As shown in Table 2, proposed schemes are efficient and give promising results. Alex-FT-SVM performs better than Alex-SVM, this is because in Alex-SVM the higher layers of the network are more precise to the details of the objects contained in ImageNet dataset. However, after fine-tuning, these layers become more precise to the details of the fish species contained in our dataset, therefore, we achieve the best accuracy of 99.45%. We conclude that the fine-tuning improves the performance of the system. We can also see that the Softmax classifier is less robust than SVM classifier, especially, for species with fewer samples like ‘Zebrasoma scopas’, ‘Hemigymnus melapterus’, ‘Scolopsis bilineata’, ‘Scaridae’, ‘Pempheris vanicolensis’, ‘Zanclus cornutus’, ‘Balistapus undulates’, ‘Neoglyphidodon nigroris’ and ‘Siganus fuscescens’.

Table 2 shows also the comparison of our approach with state-of-the-art methods on the Fish Recognition Ground-Truth dataset. In this work, we want to test a purely deep learning-based system without any layers that contain hand-crafted methods like PCA, block-wise histograms, spatial pyramid pooling (SSP) [4], nor any method to improve the performance like data augmentation or scale images as in DeepFish-SVM-aug and DeepFish-SVM-aug-scale [4]. We can observe that Alex-FT-SVM outperforms the state-of-the-art methods even those with hand-crafted layers. In Deep-CNN [3], the authors created a ConvNet with three convolutional layers and trained the network from scratch. As we can see the network trained with transfer learning gives better results than networks trained from scratch.

The Fig. 4 visualizes weights of the 96 filters, 32 filters and 64 filters in the first convolutional layer of the adopted AlexNet, DeepFish and DeppCNN respectively. The color filters extract low-frequency features and the grayscale filters extract high-frequency features. As we can see, the AlexNet has more filters than DeepFish and DeepCNN which means more variety of selective filters extracting more features at different scales and different orientations. We note that additional filters are all very nice, smooth, well-formed and without noisy patterns that make AlexNet richer by feature representations.

4 Conclusion and Future Work

Underwater live fish recognition will become a necessary tool to assist marine ecologists in studying the biodiversity in underwater areas because traditional techniques are destructive, affect fish behavior, demand time and labor costs. Proposed convolutional neural networks for fish identification require large datasets due to the huge number of parameters to be trained, especially in deeper networks. Transfer learning is a solution for training this kind of networks by using pretrained models which have been trained on a large dataset.

In this paper, we proposed to transfer learned weights of the pretrained network AlexNet which has been trained on ImageNet dataset to recognize fish species in underwater images. We have extracted fish features from images by AlexNet to feed a linear SVM before and after fine-tuning.

Experiments on the Fish Recognition Ground-Truth dataset demonstrate that the proposed approach outperforms various other approaches employed for fish species identification.

In future work, we plan to extend the method proposed here for larger underwater video datasets with more classes.