Monitoring of fish stocks across a wide range of environments is a critical task for effective management. Fisheries scientists and managers monitor fish stocks by collecting data on population abundance, biomass and densities (Egerton et al. 2018; Smith et al. 2021), schooling behaviours (Trenkel et al. 2011), predator–prey relationships (Becker and Suthers 2014; Boswell et al. 2019), and movement via key passageways (Bennett et al. 2020). Common direct sampling methods including nets (e.g. seine, gill, fyke or trawl), traps, and line fishing can be invasive and introduce sampling bias (Kuriyama et al. 2019; French et al. 2021). Indirect sampling techniques such as visual census and baited or unbaited remote underwater video (BRUV/RUV) are alternatives to direct, invasive methods, but are ineffective when visibility is poor [e.g. in turbid waters or deep offshore habitats, and at night (Becker et al. 2011; Benoit-Bird and Lawson 2016; Sheaves et al. 2020; Kimball et al. 2021)]. Acoustic camera monitoring (which uses sound reflectance, instead of light reflectance) offers a non-invasive survey method in underwater environments to overcome the obstacle of sampling where standard video imagery or visual census is unfeasible (Horne 2000).

Fisheries scientists have used acoustic cameras to monitor fish (and other animals) by detecting their direct acoustic image and/or their acoustic shadow (Horne 2000; Trenkel et al. 2011; Martignac et al. 2015). For example, acoustic cameras have been used in saltmarsh habitats to analyse predator–prey interactions (Boswell et al. 2019) and fish movement in tidal passageways (Kimball et al. 2010; Bennett et al. 2020); in areas of high turbidity caused by sedimentation to estimate size and abundance of key demersal fish (Artero et al. 2021); and, in intermittently closed estuaries to determine the abundance and the direction of fish movement, and the distribution of different sized fish (Becker et al. 2016, 2017). Coupling of direct acoustic images and acoustic shadows has enabled identification of different species (Able et al. 2014; Artero et al. 2021). Furthermore, different size classes of fish have been determined with high accuracy through the direct analysis of acoustic shadows (Langkau et al. 2012). While sampling fish using direct acoustic images or shadows is helpful when visibility is poor, as for normal video imaging, acoustic sampling produces vast amounts of data that require laborious and costly processing and analysis.

Automation techniques to overcome the challenges and costs of manually processing video footage are revolutionising monitoring in aquatic environments. For instance, a type of machine learning called deep learning (DL) uses convolutional neural networks (CNN) to analyse standard video footage to detect and classify fish (Mandal et al. 2018; Villon et al. 2018; Salman et al. 2020). Automatic detection and classification of fish increases the efficiency of monitoring the abundance of fish populations (Marini et al. 2018; Ditria et al. 2020a, 2020b), tracking movement of fish (Lopez-Marcano et al. 2021), measuring fish sizes (Álvarez-Ellacuría et al. 2020; Coro and Walsh 2021), and monitoring behaviour patterns (Saberioon et al. 2017; Ditria et al. 2021). Similar approaches to automate the processing and analysis of acoustic data have been used to detect fish aggregations (Shahrestani et al. 2017; Vatnehol et al. 2018; Tarling et al. 2021), track the speed and direction of fish in trawls (Handegard and Williams 2008), track the direction, abundance and size of salmonids (Kulits et al. 2020), detect the presence/absence of tuna (Uranga et al. 2017), and to identify and track marine mammals such as seals (Hastie et al. 2019). Although useful, these studies rely solely on the direct acoustic detection of the species of interest. Deep learning algorithms that simultaneously evaluate the direct and shadow detections might improve the accuracy of automation and, ultimately, provide a valuable tool for automated analysis of acoustic data.

Our goal was to use CNNs to train and detect fish from direct acoustic images, acoustic shadows, and the combination of direct images and shadows. This is a step towards being able to use automatic detection of direct and shadow detections from acoustic data for continuous monitoring of a wide range of metrics. We predicted that the model would be enhanced by training on both the direct and shadow detections rather than the direct detections alone.

Materials and methods

To achieve our aim of using acoustic camera data to automatically detect fish using deep learning (DL) models, we firstly acquired a dataset of labelled fish species sampled using DIDSON (dual-frequency identification sonar). DIDSON is a multi-beam high-frequency (1.1 or 1.8 MHz) sonar device that transmits acoustic pulses through the water to detect objects. Acoustic sounds are reflected when the sound wave meets an object of a different density to the medium in which the sound wave is propagated. DIDSON displays video-like images of the reflected acoustic echoes on an echogram, using colours and colour intensity to represent the objects and strength of the signal (Martignac et al. 2015).

We annotated and trained a model to detect both the fish and shadow of the fish. To obtain abundance data the model detected fish and shadows separately in each image. We then used an automated post-processing step selecting the maximum count of either fish or shadow detections (not both) in each image, generating the “combined” count. We then analysed the accuracy of the model using common performance metrics (Fig. 1). We focussed on the most widely used measure of abundance in videos, MaxN, the maximum number of fish visible in a video in any one frame (Ellis and DeMartini 1995; Whitmarsh et al. 2016; Langlois et al. 2020).

Fig. 1
figure 1

Flow diagram of deep learning models trained on acoustic imagery to detect fish and fish shadows. During post-processing, count and MaxN was calculated for direct and shadow detections only and then merged to calculate the combined count and MaxN of direct or shadow detection


The data used for the DL model were sourced from a professionally labelled acoustic dataset that is publicly available under a creative commons licence that permits use with acknowledgements (McCann et al. 2018). This dataset contains acoustic observation data sampled using DIDSON in the Ocqueoc River in northern Michigan, USA, between 2013 and 2016. Each video is sorted and labelled with the fish species known to be present. All videos used a window length of 2.5–12.5 m. To obtain enough training data, videos containing the two most common species, walleye (Sander vitreus) and common carp (Cyprinus carpio), were selected for training and testing of the DL model.

Although two species were selected from the dataset for training and testing, all data were pooled and labelled as ‘fish’ so the model could be trained and tested against different backgrounds. The two species grow to a similar size, ranging up to 80 cm in length. The carp to walleye ratio in this pooled dataset was approximately 2:1 both for direct and shadow. Because the different species did not occur in the same frame, we felt there would be bias if we trained and tested on ‘species’ because each species had a different background the model may have learned to associate with species identification. Therefore, the goal was to test automatic identification of ‘fish’, rather than to identify the different species. From the total pool of walleye and carp videos, 157 segments of video from 21 days of DIDSON deployments (stratified by day) were allocated to one of two datasets: training (115 segments) and testing (42 segments). Raw video resolution varied between 1002 × 564 (47 segments) and 1920 × 1080 (110 segments), but in any case all videos were standardised to a scale of 1002 × 564 prior to processing. Each segment was unique to either training or testing, and segments from each day were randomly allocated to each of the datasets.

Manual annotation of imagery provided the ‘ground-truth’ fish counts both to train the model and to evaluate performance. Images were extracted at 5 frames per second. Using bounding boxes around the fish and shadows, both the direct and shadow detections of the fish were manually annotated in each of the extracted images (Table 1). To assist with identification, the annotator was able to play videos back and forth to increase confidence that the object in the video was moving and could be correctly identified as either ‘fish’ or ‘shadow’ (see animation in Online Resource 1 for an example of how movement was used to annotate our dataset).

Table 1 Numbers of annotations of acoustic imagery for the direct detections of fish and shadows of fish used in deep learning models

Object detection model and performance metrics

We used a convolutional neural network (CNN) for object detection. Specifically, our model was trained using Faster-RCNN with a ResNet50 configuration, pre-trained using the ImageNet1k dataset (Massa and Girshick 2018). Model training, testing, and prediction tasks were conducted on a Microsoft Azure Data Science Virtual Machine powered by an NVIDIA V100 GPU. Overfitting was mitigated by using the early-stopping technique (Prechelt 2012).

We tested the performance of our model using two key metrics of fish abundance: count-per-image and MaxN per video (for direct, shadow and combined detections). Count-per-image was calculated over a total of 1287 images, and was used to assess performance on an image-by-image basis. MaxN was calculated for 42 video segments, and used to assess performance in an application context of providing a typical metric of abundance. For each of these metrics two performance criteria, precision (P) and recall (R), were determined for confidence thresholds between 5 and 95% in 5% increments. The confidence threshold is the level of prediction certainty required to state a detection. Precision measures the fraction of fish detections that were correct, and recall measures the fraction of fish actually present that were detected.

$${\text{Precision}} = \frac{{\text{True positive}}}{{{\text{True positive}} + {\text{False positive}}}}$$
$${\text{Recall}} = \frac{{\text{True positive}}}{{{\text{True positive}} + {\text{False negative}}}}$$

Overall performance for count-per-image and MaxN was determined by the F1 score, which represents the balance between precision and recall. F1 is calculated as follows:

$$F1 = 2 \times \frac{P \times R}{{P + R}}$$

We checked for any systematic biases in false detections by examining the size and distance from camera of objects, and comparing these for false positives and false negatives against true positives. Distances from camera were extracted directly via DIDSON software, and sizes were calculated as the area of bounding boxes as a percentage of the total image size, from predicted detections for true positives and false positives, and from manually annotated boxes for false negatives. For both distances and sizes, the frequency distributions of the three categories were compared using Kolmogorov–Smirnov two-sample tests on non-standardised frequency data, pairwise among the three detection types.


The model was successful in automatically counting fish in acoustic imagery using either the direct detection, shadows, or a combination of both (Fig. 1). At a confidence threshold of 85%, shadows improved the direct F1-score from 0.79 to 0.90 for counts, and from 0.90 to 0.91 for MaxN. Performance of the model increased because shadow detections sometimes occurred when a direct detection was missed (see the example in Fig. 2).

Fig. 2
figure 2

Example DIDSON image with ground-truthed fish count = 2. Detections shown in green (direct) and yellow (shadow) with probabilities, and counts given for 70% and 85% confidence thresholds (CT). Panel a direct detection only, which underestimates fish count, and Panel b direct and shadow detections combined, which correctly estimates fish count at 70% CT, and underestimates by one at 85% CT

For the count-per-image results, at both a lower (70%) and higher (85%) confidence threshold, our model performed best for the shadow detections alone and combined detections (Table 2; Fig. 3a). F1 scores were lowest for direct detections alone (Table 2).

Table 2 Count per image results of a deep learning model trained on acoustic imagery of direct and shadow detections of fish, at confidence thresholds of 70% and 85%
Fig. 3
figure 3

Precision and recall scores for the combined detection of fish for a Count per image, and b MaxN per video. Confidence intervals are in 5% increments, and confidence thresholds of 70% and 85% are indicated for comparison of performance (these are the two CTs reported in Tables 2 and 3)

For the MaxN per video results, at a lower (70%) confidence threshold, the model performed slightly better for shadow detections alone and combined detections than for direct detections alone (Table 3; Fig. 3b). However, at a higher confidence threshold (85%), the model performed nearly as well for all three methods of detecting fish, with the combined detections only marginally higher than the direct or shadow detections alone (Table 3; Fig. 3b).

Table 3 MaxN per video results of a deep learning model trained on acoustic imagery of direct and shadow detections of fish, at confidence thresholds of 70% and 85%

In comparing the distances from camera and sizes of false detections against true positives, we found no pattern for distance from camera, but object detection size varied significantly among these detection types (Kolmogorov–Smirnov tests, all p values < 0.01; Fig. 4, noting that for visual interpretation, frequencies are displayed standardised by total counts, whereas KS tests were on non-standardised data). Both types of false detections had a higher proportion of very small detections than for true positives, with marginally smaller images for false negatives than false positives. Most false detections were around 10% or less of the total image area. These are small images as observable on screen, and do not necessarily reflect fish sizes, which vary with distance from cameras.

Fig. 4
figure 4

Frequency distribution of detection area for different detection types. Frequencies for true positive (TP), false positive (FP) and false negative (FN) detections are shown standardised as a proportion of the total number of detections for that detection type. Standardisation simplifies visual analysis since the total counts for true positives were much greater than either false category. Detection area is reported as area of detection box as a percentage of total image area, a proxy for the size of the fish or shadow as actually observed in the frame


We have presented a successful method for automatically detecting fish from acoustic imagery. The CNN reliably detected fish using either direct or shadow detections, or in combination, achieving high F1 scores for all three methods of detection. This automated method has the potential to reduce the time and cost of manually counting fish using acoustic data, and particularly so when MaxN is the desired measure of fish abundance. The level of accuracy achieved is equal to or above that reported previously for CNNs on sonar imagery. Using a CNN model to detect eels swimming through a weir in Canada, Zang et al. (2021) reported high accuracy (0.89), although on a relatively small number of videos. These authors had previously achieved higher accuracy (0.99) using a similar model in a controlled laboratory environment (Zang et al. 2021), but found that the model performed poorly on field data (0.5). Automated detection of salmon in sonar imagery using a CNN in conjunction with optical flow to detect pixel changes between sequential frames yielded accuracy of 0.8 (Kulits et al. 2020). The presence of large schools of mullet swimming along the coast have also been detected with accuracy of 0.89 (Tarling et al. 2021). All of these methods used only direct detection, not shadows. The current paper adds substantially to the view that CNNs will be very useful for automatically detecting fish versus no-fish in sonar imagery. Reliable detection of fish using DL techniques such as CNNs is clearly possible, and as the field develops, we encourage others to consider the inclusion of shadow detections.

Our results indicate that shadows can be a useful addition to include in model training and predicting when using CNNs, and probably for any other automation technique where shadows are present in the acoustic data being analysed. Previous studies using semi-autonomous fish counting methods have suggested that acoustic shadows are an impediment that reduced the accuracy of software solutions (Eggleston et al. 2020; Perivolioti et al. 2021). We have shown convincingly that if shadow information is included in training of detection algorithms, shadows detection can improve performance. We suspect that previous reports of difficulties with shadows adversely affecting fish counts might have resulted from a lack of shadow input in model training, or perhaps from shadows being unusable. Although we have demonstrated the usefulness of shadows in the imagery analysed, further investigation will be required to test how generalisable this finding is to DIDSON imagery more broadly. Shadow formation is affected by factors such as the angle of the acoustic camera to the substrate, substrate complexity, and fish orientation and behaviour. We largely used imagery in which fish were migrating up and down stream, swimming perpendicularly to cameras, throwing relatively large and easily detected shadows. When fish milled around, perhaps for foraging, orientation and direction changed frequently and shadows were often small or thin as a smaller body profile was exposed to the sonar, with poorer detectability. The usefulness of shadows for identification of different species (or morphospecies) in manual analysis of DIDSON imagery has been pointed out by Langkau et al. (2012), although they suggest that accuracy is poor for smaller sized individuals. Further experimentation into the usefulness of shadows is warranted, to distinguish the roles of camera position and orientation relative to fish, and substrate type. Where the morphology of background substrate is known, the relationships between camera position and the distance between fish and shadow detection can potentially provide a metric of fish position within the water column.

Automatic detection of species (or morphospecies) using multi-class models will be an important future step in improving the value of automating acoustic monitoring. At this stage, however, both manual and automatic species identification has proven problematic due to the nature of acoustic data (Martignac et al. 2015). High accuracy of manual species identification can occur when species have distinct morphological features (Martignac et al. 2015; Jones et al. 2021), and automation should also be successful for species that show clear morphological differences. Automatic species identification has been partially successful for eels (Zang et al. 2021), but attempts for other types of fish have had limited success (Rogers et al. 2004; Jones et al. 2021). Automatic species identification could be improved by analysing behavioural characteristics, such as tailbeat frequencies, which have been used successfully in manual analysis of acoustic data for species identification (Kang 2011; Helminen et al. 2021). Other behaviours such as swimming speed and feeding activities could also be investigated to improve automation and combined with length data where only particular species are known to obtain sizes above certain limits. Sequential non-maximum suppression (SeqNMS), an object tracking method where the model examines neighbouring images in a video to improve the accuracy of detection, has been used to automatically detect the direction and speed of fish in underwater videos (Lopez-Marcano et al. 2021). SeqNMS could also prove useful for acoustic data for species identification. The unique grazing behaviour exhibited by fish in seagrass has also been automatically tracked (Ditria et al. 2021), and with some refinement to the model, this may also be a useful method to detect feeding behaviours to differentiate among species in acoustic data, so long as seagrass does not adversely affect quality of shadows or the acoustic imagery overall.

Apart from DL methods, automating and semi-automating the analysis of acoustic data has been performed using classic machine learning techniques. Commercially available software called Echoview ( allows users to semi-automate the acoustic data analysis through training of predefined algorithms (Boswell et al. 2008). Some applications that demonstrated a reduction in analysis time or successful semi-automation of the process using Echoview include the tracking of migrating fish (Kang 2011; Helminen and Linnansaari 2021), counting fish (Kang 2011; Eggleston et al. 2020), and monitoring behaviour such as tailbeat frequencies (Mueller et al. 2010; Kang 2011; Helminen et al. 2021). Other studies have demonstrated the usefulness of more ‘traditional’ ML techniques using statistics and/or advanced algorithms for classifying, counting, and sizing of fish using acoustic data (Han et al. 2009; Bothmann et al. 2016; Jing et al. 2017; Lawson et al. 2019). We suggest that because all of these methods require specialised statistical skills for each new application, a DL model, once evaluated and performing reliably, will be an easier method for scientists to apply. Even the problematic detections of very small objects in the current study might in future be overcome as resolution of imagery from acoustic cameras continues to improve. The DIDSON is already being superseded, for example, by imaging sonar that can operate at 3 MHz, improving resolution at shorter ranges. In terms of the amount of effort required for training, in the order of several thousand annotations typically will be required to achieve suitable model performance (Sheaves et al. 2020). While the accuracy required for applied automation varies with study objectives, generally F1 scores above 0.8 are considered useful, and above 0.9 very good.

Our model performed well using direct, shadow, and the combination of these detections; however, we acknowledge that our study used a limited database with low densities of fish present in each image (typically 3 or fewer). Even for manual counting of acoustic imagery, higher densities can render counts unreliable (Horne 2000), and dense schooling behaviour makes automated tracking of individual fish difficult (Handegard and Williams 2008; Lopez-Marcano et al. 2021). Despite the challenges, high densities of fish are common in any form of video imagery and higher densities of fish in acoustic images should be included in training and testing to improve the usability of the model. Post-processing steps such as varying confidence thresholds, the use of SeqNMS, and statistical adjustment equations could assist in overcoming the issue of individuals obscuring other individuals in acoustic imagery, as has been demonstrated in underwater videos (Connolly et al. 2021).

We have shown that using a DL technique such as CNN can automatically detect fish in acoustic data and has the potential to substantially improve the efficiency of acoustic data analysis. For the short videos analysed here, with relatively low fish abundances, manual extraction of MaxN data took a fish expert on average 1.8 min per min of video (SE 0.16). Computer estimates of MaxN took about half the time in the current study, at 0.95 min per min of video (no variation, so no SE). Much faster computer speeds are possible, however, using parallelisation; e.g. using two servers in parallel doubles the processing speed. We have also highlighted the usefulness of acoustic shadow detections in DIDSON data to improve model accuracy for counting of fish. This method is suitable for fisheries-independent monitoring of exploited species to inform fisheries stock assessments, and quantifying use of fish passageways when fish densities are low. The approach needs further investigation at higher fish densities and for species identification.