Background

Understanding the movement and behaviour of animals in their natural habitats is the ultimate goal of behavioural and movement ecology. By situating our studies in the natural world, we have the potential to uncover processes of selection acting on behaviour in natural populations. The ongoing advance of animal tracking and biologging brings the opportunity to revolutionize not only the scale of data collected from wild systems, but also the types of questions that can subsequently be answered. Incorporating geographical data has already given insights, for example, into the homing behaviour of reef fish, migratory patterns of birds, or the breeding site specificity of sea turtles [13]. Great advances in systems biology have further been made through the study of movement ecology, for example understanding the decision-making processes at play within primate groups manoeuvring through difficult terrain or the collective sensing of birds traversing their physical environment [4, 5]. Unravelling these aspects of animal movement can vastly improve management strategies [6, 7], for example in the creation of protected areas that incorporate bird migratory routes [8] or by reducing by-catch with dynamic habitat usage models of marine turtles [9].

Yet the application of techniques that meet the challenges of working in naturally complex environments is not straightforward, with practical, financial, and analytical issues often limiting the resolution or coverage of data gathered. This is especially true in aquatic ecosystems, where approaches such as Global Positioning System (GPS) tags allow only sparse positioning of animals that surface intermittently, or Pop-up Satellite Archival Tags (PSATs) which integrate surface positions with logged gyroscope and accelerometer data to estimate movement of larger aquatic animals [10, 11]. Not only does the spatial resolution of respective tracking systems, e.g. currently 4.9 m for GPS, limit the possibilities of behavioural analyses on a fine scale, but also excludes almost all animals below a certain size class [12]. These methods also require animals to be captured and equipped with tags that should not exceed 5% of the animals weight [13], further limiting current generation GPS and PSATs to larger animals. This is problematic because in aquatic ecosystems, as in terrestrial systems, life is numerically dominated by small animals [14]. In contrast, ultrasonic acoustic telemetry is one methodology useful for underwater tracking of smaller animals and those in larger groups [11, 15]. This approach is limited to a stationary site through the positioning of the acoustic receivers, and the costs, maintenance, and installation of these systems preclude their effective use in the majority of systems and for many users. While acoustic tags are small enough for injection, even for smaller animals such as fish, the increased handling time associated with these invasive measures can lead to additional stress for the animals, whereas the tag itself may disturb the animals’ natural behaviour [16]. Further, acoustic telemetry systems also face accuracy problems, with average positional errors in the range of multiple meters, and highly depend on the environment (such as low ambient noise or sufficient water depth) in which these systems are deployed in [17, 18]. Hence, approaches that facilitate collection of behavioural data in smaller animals, those in large groups, and those in varied aquatic habitats, are still lacking.

A lack of data becomes a fundamental problem if certain ecosystems, species, or habitat types are underrepresented in terms of adequate research, management, or discovery. Although the oceans constitute up to 90% of habitable ecosystems worldwide, as little as 5% have been explored [1921]. Within the oceans, coastal inshore areas have the greatest species diversity, with approximately 80% of fish species (the most speciose group of vertebrates) inhabiting the shallow waters of the littoral zone [22], and providing over 75% of commercial seafood landings [23]. Coastal regions in both marine and freshwater environments are also those that are of greatest interest for eco-tourism, community fisheries, and industry, while simultaneously being most affected by habitat degradation, exploitation, and anthropogenic pollution [2426]. Knowledge of the coastal regions is essential for establishing sanctuaries and sustainable concepts of ocean preservation [27] and movement data plays a vital role in this process, insofar as it gives detailed information about the location, preferred habitat and temporal distribution of organisms [13]. Yet for reasons of animal size, species abundance, and habitat complexity, most available tracking methods are poorly suited to these inshore regions.

Application of appropriate tracking and behavioural analysis techniques in a flexible, accessible, and broadly applicable manner would alleviate these limitations in systems and species coverage, improving capacity for conservation, management, and scientific understanding of natural systems across scales and conditions. In pure research terms, the application of quantitative behavioural and movement analyses in natural settings would also help bridge the gap between quantitative lab-based research and often qualitative field-based research. Recent advances in computational analysis of behaviour [28, 29] may then be employed in field settings, vastly improving our understanding of behaviour and movement in aquatic ecosystems.

Here we present an open-source, low-cost approach based on consumer grade cameras to quantify the movement and behaviour of animals of various sizes in coastal marine and freshwater ecosystems. Our approach integrates two methodologies from the field of computer vision, object detection with deep neural networks and Structure-from-Motion (SfM). Object detection has been successfully employed in terrestrial systems for animal localization, yielding high resolution movement data through e.g. drone-based videos over broad environmental contexts [30]. While these aerial approaches may also be used in some aquatic systems, they are limited to extremely shallow water and large animals [31]. The approach we advocate allows data to be collected on any animal that can be visualized with cameras, enabling application in smaller fish, invertebrates, and other aquatic animals. In addition to providing animal trajectories, video-based observations also incorporate environmental data that adds the possibility to study interactions of mobile animals with their natural habitat [4]. Our approach synthesizes object detection with SfM into a coherent framework that can be deployed in a variety of systems without domain-specific expertise. SfM is commonly used for 3D environmental reconstructions, photogrammetry and camera tracking for visual effects in video editing [32, 33], and here allows the reconstruction of 3D models of the terrain through which the animals move and interact with. Our open-source analysis pathway enables subsequent calculation of movement, interactions, and postures of animals. Set-up costs can be as small as two commonly available action cameras, and the proposed method can be taken into habitats which are otherwise explored by snorkeling, diving, or with the use of remotely operated underwater vehicles (ROVs). Analysis can be performed on local GPU-accelerated machines or widely-accessible computing services (e.g. Google Colaboratory). Overall, this method provides a low-cost approach for measuring the movement and behaviour of aquatic animals that can be implemented across scales and contexts.

Methods

Three datasets of varying complexity were used to demonstrate the versatility of the proposed method. These were chosen to range from single animals (Conger conger) and small heterospecific groups (Mullus surmuletus, Diplodus vulgaris) to schools of conspecific individuals (Lamprologus callipterus) under simple and complex environmental conditions, resulting in the datasets ’single’, ’mixed’ and ’school’, respectively. Moreover, we used a dataset of repeated trials (N = 4, ’accuracy’) to validate the accuracy of our tracking approach. This dataset was used to reconstruct the trajectories of a calibration wand of 0.5 m length and examine resulting tracking errors. The ’single’ and ’mixed’ datasets were created while snorkeling at the surface, using a stereo camera set-up at STARESO, Corsica (Station de Recherche Océanographiques et sous-marines). The remaining datasets were collected by SCUBA diving (5–8 m) with either multi or stereo camera set-ups in Lake Tanganyika (Tanganyika Science Lodge, Mpulungu, Zambia), or at STARESO. While the ’single’ and ’mixed’ datasets were recorded with untagged fish, we attached tags made of waterproof paper (8 ×8 mm) anterior to the dorsal fin of the individuals for the ’school’ dataset to facilitate detection and individual identification, although the latter was not implemented. See Table 1 for a summary of the collected datasets and respective environmental conditions. For a general guideline and comments on the practical implementation of our method, refer to Additional file 7.

Table 1 Summary of acquired datasets

Automated animal detection and tracking

Since all data was collected in the form of videos, image-based animal detection was required for subsequent trajectory reconstruction and analyses. First, the videos from the stereo or multi-camera set-ups were synchronized using a convolution of the Fourier-transformed audio signals to determine the video offsets. Second, the synchronized videos were tracked independently using an implementation of a Mask and Region based Convolution Neural Network (Mask R-CNN) for precise object detection at a temporal resolution of either 30 Hz (’single’, ’mixed’ and ’accuracy’) or 60 Hz (’school’) [34, 35]. To this end, we annotated the contours of the fish (or the tags in case of the ’accuracy’ dataset) in a small subset of video frames to generate custom training datasets for each of the detection tasks. These subsets needed to be sufficiently diverse to represent the full videos for effective training and, therefore cover most of the variation in contrast, lighting and animal poses. Our training sets contained 171 (’single’), 80 (’mixed’), and 160 (’school’) labeled images for each dataset. For the ’accuracy’ dataset, we annotated a total of 73 images. We then trained Mask R-CNN models on these training sets using transfer learning from a model that was pre-trained on the COCO dataset (’Common Objects in Context’) with more than 200K labeled images and 80 object classes [35, 36]. Here, transfer learning refers to a machine learning concept in which information gained from learning one task is applied to a different, yet related problem [37]. Therefore, the state of Mask R-CNN, previously trained on COCO, was fine-tuned to our specific problems of identifying fish or tags. The original image resolutions of 2704 ×1520 px (’single’ and ’school’) and 3840 ×2160 px (’mixed’ and ’accuracy’) were downsampled to a maximum width of 1024 px while training and predicting to achieve better performance. After training, the models were able to accurately detect and segment the observed animals, which was visually confirmed with predictions on validation datasets.

The predicted masks were either used to estimate entire poses of the tracked animals (’single’, ’mixed’) or to calculate centroids of the tags or calibration wand ends in case of the ’school’ and ’accuracy’ datasets. Established morphological image processing was used to skeletonize the Mask R-CNN predictions, producing a 1 px midline for each of the detected binary masks. A fixed number of points was equidistantly distributed on these midlines as an estimation of the animals’ spine poses. Both the spine points and the tag centroids represent pixel coordinates of detected animals in further data processing. Partitioned trajectories were generated from detections with a simple combination of nearest-neighbors between subsequent frames or utilizing a cost-reduction algorithm (the Hungarian method [38]), and filtering for linear motion over a short time window, reducing later quality control and manual track identification for continuous trajectories to a minimum. For video and image annotations, trajectory and pose visualization, manual track corrections and other trajectory utility functions, we developed a GUI based on Python and Qt5 within the lab (’TrackUtil’, Additional file 4). The code for Mask R-CNN training and inference, video synchronization, fish pose estimation and automatic trajectory assignment is also available (Additional files 5 and 6). The training and tracking details are summarized in Table 2.

Table 2 Dataset parameters and accuracy metrics

Structure from motion

The field of computer vision has developed powerful techniques that have found applications in vastly different fields of science [3941]. The concept of Structure-from-Motion (SfM) is one such method that addresses the large scale optimization problem of retrieving three dimensional information from planar images [42]. This approach relies on a static background scene, from which stationary features can be matched by observing them from different perspectives. This results in a set of images, in which feature-rich key points are first detected and subsequently used to compute a 3D reconstruction of the scene and the corresponding view point positions. As shown in Eqs. 1 and (2), a real world 3D point M (consisting of x, y, z) can be projected to the image plane of an observing camera by multiplying the camera’s intrinsic matrix K (consisting of focal lengths fx, fy and principal point cx, cy), with the camera’s joint rotation-translation matrix [R|t] and M, resulting in the corresponding image point m (consisting of pixel coordinates u, v, scaled by s) [43]. By extension, this can be used to resolve the ray casting from a camera position towards the actual 3D coordinates of a point given the 2D image projection of that point with known camera parameters. Due to this projective geometry, it is not possible to infer at which depth a point is positioned on its ray from a single perspective. SfM is able to circumvent this problem by tracking mutually-observed image points (m) across images of multiple camera view points. As a result, the points can be triangulated in 3D space (M), representing the optimal intersections of their respective rays pointing from the cameras positions towards them. By minimizing reprojection errors, which are the pixel distances between the 3D points’ reprojections to the image planes and their original image coordinates (u, v), SfM is also able to numerically solve the multi-view system of the cameras relative rotation (R), translation (t) and intrinsic (K) matrices and to retrieve the optimal camera distortion parameters (d).

$$ m' = K[R|t] M' $$
(1)
$$ s \left[\begin{array}{c} u \\ v \\ 1 \\ \end{array}\right] = \left[\begin{array}{lll} f_{x} & 0 & c_{x} \\ 0 & f_{y} & c_{y} \\ 0 & 0 & 1 \\ \end{array}\right] \left[\begin{array}{llll} r_{11} & r_{21} & r_{13} & t_{1} \\ r_{12} & r_{22} & r_{23} & t_{2} \\ r_{13} & r_{23} & r_{33} & t_{3} \\ \end{array}\right] \left[\begin{array}{l} x \\ y \\ z \\ 1 \\ \end{array}\right] $$
(2)

Here, SfM was incorporated into data processing in order to gain information about exact camera positions, which was done using the general-purpose and open-source pipeline COLMAP [44, 45]. The synchronized videos were resampled as image sequences with a rate of 3 Hz. In case of the ’mixed’ dataset, we removed frames that were recorded when the cameras were stationary. The resulting image sequences served as input into the reconstruction process during which the cameras were calibrated (K, d) and relative extrinsic parameters (R, t) computed, so that all camera projections relate to a shared coordinate system. Every input image resulted in a corresponding position along the reconstructed, 3D camera path of the recording, where the number of images determined the temporal resolution of resolved camera motion. Since only a subset of all video frames were used for the reconstructions, SfM optimized a smaller number of parameters, resulting in a reduced computational load. Additionally, this could improve reconstruction accuracy, as the images still had sufficient visual overlap, but increased angles between view points. Finally, the retrieved camera parameters were interpolated (the translations t linearly, rotations R using Slerp, spherical linear interpolation [46]) to match the acquisition rate of animal tracking, assuring that reference camera parameters are given for each recorded data point by simulating a continuous camera path.

Reconstruction of animal trajectories

It is necessary to resolve the camera motion when tracking moving animals with non-stationary cameras, since the camera motion will also be represented in the pixel coordinate trajectories of the animals. With camera information (K, d) and relative perspective transformations (R, t) for entire camera paths retrieved from SfM, as well as multi-view animal trajectories from the Mask R-CNN detection pipeline available, a triangulation approach similar to SfM can be used to compute 3D animal trajectories. Here, an animal’s pixel coordinates represent m (consisting of u and v) observed from more than one known view point (R, t), and the animals 3D positions M (x, y, z) can be triangulated. Positions of animals observed in exactly two cameras were triangulated using an OpenCV implementation of the direct linear transformation algorithm, while positions of animals observed in more than two cameras were triangulated using singular value decomposition following an OpenCV implementation [39, 43]. Additionally, positions of animals temporarily observed in only one camera were projected to the world coordinate frame by estimating the depth component as an interpolation of previous triangulation results. Through the recovered camera positions, the camera motion is nullified in the resulting 3D trajectories. Thus, they provide the same information as trajectories recorded with a fixed camera setup (Fig. 1). Animal trajectories and the corresponding reconstructions were scaled, so that the distances between the reconstructed camera locations equal the actual distances within the multi-view camera setup. As a result, all observations are represented on a real world scale. The code for trajectory triangulation, camera path interpolation and visualizations is bundled in a Python module (’multiviewtracks’), accessible on GitHub [47].

Fig. 1
figure 1

Schematic workflow. Data processing starts with the acquisition of synchronized, multi-view videos, which serve as input to the SfM reconstruction pipeline to recover camera positions and movement. In addition, Mask R-CNN predictions, after training the detection model on a subset of images, result in segmented masks for each video frame, from which animal poses can be estimated. These serve as locations of multi-view animals trajectories in the pixel coordinate system. Subsequently, trajectories can be triangulated using known camera parameters and positions from the SfM pipeline, yielding 3D animal trajectories and poses. Integrating the environmental information from the scene reconstruction, these data can be used for in depth downstream analyses

Accuracy estimation

Given that the proposed method incorporates out-of-domain and novel approaches from computer vision, reliable accuracy measures are required. Therefore, a ground-truth experiment (’accuracy’ dataset) was conducted in which two points of fixed distance to each other, the colored end points of a clear, rigid calibration wand (0.5 m), were filmed underwater over various backgrounds using four cameras. In total, four repeated trials were incorporated for the accuracy estimation, varying in environmental complexity (such as Poseidonia sea grass beds, large rock formations or sand), depth and lighting conditions. The trajectories of both calibration wand end points were reconstructed throughout the four trials, which enabled the calculation of a per-frame tracking error. The ground-truth distance (\(\hat {d}\)) between the two 3D positions is known from the wand length for each frame, hence the difference of the reconstructed distance of these two 3D positions (d) from the actual distance (\(\hat {d}\)) can be calculated as this tracking error. Additionally, since the cameras were arranged in a fixed, multi-view setup, the same calculation can be performed on the known camera-to-camera distances (\(\hat {d}\)) within the array and their reconstructed 3D positions (and respective distances d) to assess errors of the SfM reconstructions. A third measure of accuracy can be calculated as the reprojection error of triangulated trajectory points. Here, the 3D points are projected back to the image planes of their respective view points, resulting in pixel coordinates for each 3D point and observing camera. The distance of these pixel coordinates to the tracked pixel coordinates m (consisting of u and v) is the reprojection error. This is the error which is used by SfM for numeric optimization of the multi-view system, the camera parameters and the scene’s 3D point cloud, and can be similarly used to estimate the precision of the acquired trajectories. We calculated the median errors and the standard deviations of the errors, i.e. the root-mean-square errors (RMSEs, equation 3) for all datasets and for each of the three accuracy metrics when applicable. In case of the ’accuracy’ dataset, we calculated the mean and standard deviation of the accuracy metrics for the four trials.

$$ RMSE = \frac{1}{N} \sqrt{\sum_{N=1}^{N} (\hat{d} - d_{N})^{2}} $$
(3)

Results

Here we combined Mask R-CNN aided animal detection and tracking with SfM scene reconstruction and triangulation of 3D animal trajectories to obtain high resolution data directly from videos taken while snorkeling or diving in the field. Using this method, we were able to track freely moving, aquatic animals in their natural habitats without installation of infrastructure.

In order to ground truth our method, we performed an accuracy estimation for the four trials of the ’accuracy’ dataset. Using our approach, we were able to retrieve both the 3D positions of the tracked calibration wand and the 3D trajectories of the cameras throughout the trials (Fig. 2). The mean trajectory coverage was 80.64 ±16.73% when only multiple-view triangulation was used, or 97.3 ±2.2%, when also projections from single views were used to estimate trajectory positions at an interpolated depth component. This resulted in a total of 19482 frames in which both wand ends were detected (or 26562 with the additional single-view projections). The known camera-to-camera distances within the camera array (0.6 m) and the known length of the calibration wand (0.5 m) allowed the calculation of respective per-frame reconstruction and tracking errors. The resulting RMSE for the camera-to-camera distances was 1.34 ±0.79 cm (median error -0.14 ±0.06 cm). The errors for the calibration wand length differed when calculated for only multi-view triangulated trajectories (RMSE 1.09 ±0.47 cm, median error 0.14 ±0.33 cm) or for trajectories with single-view projections (RMSE 2.12 ±1.37 cm, median error 0.28 ±0.32 cm). Further, we projected the triangulated 3D positions back to the original videos and computed the reprojection error as a RMSE of 8.56 ±5.21 px (median error 3.53 ±1.96 px). This was only done for the multi-view triangulations, since the reprojection of a point that was projected from a single view is, by definition, the same point (with a potentially misleading error of 0 px).

Fig. 2
figure 2

Accuracy validation. Top down view of one of the ’accuracy’ dataset trials with the COLMAP dense reconstruction in the background (left). A calibration wand with a length of 0.5 m was moved through the environment to create two trajectories with known per-frame distances (visualized as lines at a frequency of 3 Hz, the full temporal resolution of the trajectories is 30 Hz). This allowed the calculation of relative tracking errors as the difference of the triangulated calibration wand end-to-end distance from the its known length of 0.5 m, resulting in the shown error distribution (normalized histogram with probability density function, right). The per-frame tracking error is visualized as line color

Trajectories were successfully obtained from large groups (’school’), small groups (’mixed’) and single individuals (’single’). Due to the specific design of Mask R-CNN for instance segmentation, the network architecture was able to distinguish given object classes from the background and solved partial occlusions. However, differences in data acquisition remained across these datasets. For example, this resulted in varying track coverage, 97.79%, 69.60% and 78.34% for the ’single’, ’mixed’ and ’school’ datasets, respectively. When also single-view projections were included in the animal trajectories, the trajectory coverage increased to 100.00% (’single’) and 94.02% (’school’). Additionally, the camera positions and corresponding environments through which the animals were moving were reconstructed. In case of the ’single’ and ’mixed’ datasets, the Mask R-CNN detection results were used to estimate fish body postures in 3D space by inferring spine points from the segmented pixel masks (Fig. 3). We computed the RMSEs of the camera-to-camera distances (1.28 cm ’single’, 1.28 cm ’mixed’ and -0.15 cm ’school’) and reprojection errors (20.97 px ’single’, 7.77 px ’mixed’ and 6.79 px ’school’) to assess the overall quality of the SfM reconstructions analogously to the calculation of reconstruction errors for the ’accuracy’ dataset. The results of the accuracy estimations and respective median errors are listed in Table 2.

Fig. 3
figure 3

3D environments and animal trajectories. a Top down view of the ’single’ dataset result. Red lines and dots show estimated spine poses and head positions of the tracked European eel (C. conger, visualized with one pose per second). The point cloud resulting from the COLMAP reconstruction is shown in the background. b Trajectories of M. surmuletus (orange) and D. vulgaris (purple/blue), and the dense point cloud resulting from the ’mixed’ dataset. Dots highlight three positions per second, lines visualize the trajectories at full temporal resolution (30 Hz) over a duration of seven minutes. b Reconstruction results and trajectories of the ’school’ dataset, visualizing the trajectories of a small school of L. callipterus in Lake Tanganyika. See Additional files 1, 2, 3 for high resolution images

Discussion

Here we demonstrate a novel approach to collect highly resolved 3D information of animal motion, including interactions with the physical environment, in aquatic ecosystems. Although being based on relatively advanced computational techniques, the open-source workflow we present requires little domain expertise and can be implemented with low-cost consumer grade cameras. The incorporation of these methods into an accessible framework will allow quantitative analyses of animal behaviour and ecology across systems, scales, and user groups, and can even be modified for use in terrestrial systems. Our approach allows data collection while swimming, snorkelling, or with the aid of ROVs, making it appropriate for general usage with minimal investment into infrastructure, equipment, or training. Although analyses are computationally demanding, they can be achieved on an average GPU or free cloud-based computing services. The lack of high-end hardware therefore does not interfere with any of the steps required for this method.

Many alternative techniques for tracking of small aquatic animals do exist, however, they often have the considerable drawback of tagging and handling the animals or high infrastructure costs. This is a major barrier to implementation when animals are protected, difficult to catch, or too small to carry tags. In many marine protected areas all three of these factors apply, meaning that many existing approaches are inappropriate. Some of these drawbacks will be alleviated, for instance with improvements in telemetry-based approaches [48] that reduce tag size and increase range. Nevertheless, these techniques cannot simultaneously measure or reconstruct local topography and environmental factors. Although here we do not provide any analyses of environmental structure, this topographical information collected with our approach can be directly used to answer questions on e.g. habitat segmentation and environmental complexity [49, 50].

In highly complex social environments, encounters with numerous con- and heterospecifics can strongly affect behaviour and motion [51]. Using approaches that rely on tagging will unavoidably miss or under-sample these interactions because not all individuals can ever be tagged in wild contexts. In contrast, our approach does not require animals to be handled or tagged, nor does specialized equipment need to be deployed in the desired tracking area. Moreover, because the object detection and segmentation approach can take any image input, it is not tied to one particular animal form or visual scene. Our approach can therefore be used even in demanding conditions such as high turbidity or low-light conditions, within certain limits. While it has a lower spatial range than telemetry, underwater filming comes as an unintrusive alternative, with higher spatial resolution possible when small animals are moving over small areas, or when animals are highly site-specific, for example damselfish or cichlids living in close association with coral or rocky reef [52, 53].

While our approach offers many benefits in terms of applicability and data acquisition, it also suffers from some limitations. From the accuracy tests it became apparent that in cases where the background was composed of moving objects, such as macrophytes or debris, the tracking accuracy dropped noticeably. The SfM approach relies on the reconstructed components to be feature-rich and static, because environmental key-points are assumed to have the same location over time. Moving particles and objects will result in higher reconstruction errors, rendering our approach problematic e.g. when the filmed animals occupy most of the captured images in case of very large fish schools. Complex environments, occlusions of the animals and highly variable lighting conditions are detrimental to the detectability of animals with Mask R-CNN. Observations at greater depths may face similar problems due to the high absorption of red light, although, in this case, detectability could be alleviated through image augmentation approaches such as Sea-Thru [54]. Similarly, water turbidity can greatly affect the detectability in aquatic systems by absorbing light and diffusing the scene. Therefore, although removing the benefits from measuring animal behaviour non-invasively, it can be advantageous to add clearly visible tags to the animals in cases of high turbidity, ensuring continuous tracking of all individuals.

Another aspect that needs consideration is that data acquisition is confined to the area captured by the multi-camera set-up. Animal trajectories can not be triangulated if individuals leave this area, and therefore are no longer visible from at least two camera view points. This circumstance is apparent in the ’school’ example, in which one individual left and re-entered the scene, leading to a discontinuity in its trajectory. To compensate the potential limitation in trajectory coverage, trajectory points can also be estimated from single-view detections by projecting them from this view point to an interpolated depth. However, this is only possible when filmed from above and for animals that do not drastically change the distance to the camera (otherwise, the estimated depth component would likely be erroneous). As a consequence, we report the lowest trajectory coverage in the ’mixed’ dataset (69.60%), in which we filmed the animals with a semi-stationary tripod and isometric camera angles. Considering the temporal resolution of 30 Hz, this still resulted in a relatively high average detection rate of approximately 21 detections per second. Further, we could demonstrate with the ’accuracy’ dataset, that although the single-view projections can increase track coverage significantly (from 80.64 ±16.73% to 97.29 ±2.20%), they also come at a moderate accuracy cost (calibration wand length RMSE increased from 1.09 ±0.47 to 2.12 ±1.37 cm).

The estimation of 3D animal poses strongly relies on accurate detections and can therefore be compromised by poorly estimated animal shapes during Mask R-CNN segmentation. In these cases, a less detailed approximation of the animals’ positions such as the mask centroids are favorable and can still be reliably employed as showcased with the ’school’ dataset. The errors in estimating animal locations and poses can be partially explained by marginal detection errors of Mask R-CNN, but also by inaccuracies derived from trajectory triangulation using the SfM camera positions.

Aware of these error sources, users can incorporate accuracy metrics such as reprojection errors or relative camera reconstruction RMSEs into their own analytical pathways by using our proposed method. This enables the assessment of the overall reconstruction quality and required fine scale resolution for the specific scientific demands. We were able to demonstrate with the ’accuracy’ dataset, that the combination of SfM and object detection yields highly accurate trajectories of moving objects over large spatial scales (RMSE tracking error of 1.34 ±0.79 cm, median error -0.14 ±0.06 cm, reconstructed areas up to 500 m2) without prior manipulation of the underwater environment. Since these accuracy calculations are based on per-frame deviations from known distances, such as the length of a calibration wand or camera-to-camera distances in a stereo-camera setup, they are not suited for the assessment of large-scale SfM accuracy. However, rigorously ground-trouthing SfM is of general interest in the field of computer vision, and various benchmarks showcase the high precision of 3D reconstructions that can be achieved using current SfM pipelines [55, 56].

An additional requirement of our approach is associated with the need to annotate images and train object detection networks. Further, manual correction of false trajectory assignments and overall quality-control are required, but can be reduced to a minimum with adequately-sized training sets and resulting, precise Mask R-CNN predictions. Reliable and automatic identification of unmarked individuals in large animal groups recently became possible in laboratory conditions [57], and future development and increasing robustness of similar methods might also enable them for field observations. However, at present, these tasks present an additional, mainly initial, time investment that is likely to be compensated by the time subsequently saved using high-throughput behavioural analyses on the acquired, highly-resolved animal trajectories. For example, this allows the classification of behavioural states by quantifying the behavioural repertoire of the animals using unsupervised machine learning techniques [28, 58]. The incorporation of 3D trajectory data in motion analyses has already improved the understanding of the phenotype and development of animal behaviours [59]. In addition, 3D pose estimation can now be achieved for wild animals, enabling exact reconstruction of the entire animal [60]. There has been a shift in how animal movement is analyzed in light of computational ethological approaches [6163], with patterns of motion able to be objectively disentangled, revealing the underlying behavioural syntax to the observer. Automated approaches based on video, or even audio, recordings may also overcome sensory limitations of other systems, allowing a better understanding of the sensory umwelt of study species [64] and also facilitate novel experimental designs [61, 65] that can tackle questions of the proximate and ultimate causality of behaviour [60, 62, 63]. These methods are gaining interest and sharply contrast with the traditional approach of trained specialists creating behavioural ethograms, but can usefully be combined and compared to gain further insight into the structure of animal behaviour, potentially generating a more objective and standardized approach to the field of behavioural studies [63].

In order to incorporate these novel techniques into more natural scenarios, we aim to present a complete tracking pipeline, guiding the user through each step after the initial field observation. From video synchronization, object annotation and detection to the final triangulation of animal trajectories, we provide a set of open-source utilities and scripts. Although we heavily rely on other open-source projects (COLMAP for SfM and Mask R-CNN for object segmentation), these specific approaches can be replaced with other implementations by solely adopting respective in- and output data formatting for specific needs. We found COLMAP and Mask R-CNN to be easily employed, as they are well documented, performant and purpose-oriented. However, many alternatives exist for both SfM and object detection, and the general approach of our pipeline is not limited to any particular implementation, thus future-proofing this approach as new and better methods are developed.

Conclusions

Computational approaches to analyze behaviour, including automated tracking of animal groups, deep-learning, supervised, and unsupervised classification of behaviour, are areas of research that have been extensively developed in laboratory conditions over the past decade. These techniques, in combination with sound evolutionary and ecological theory, will characterize the next generation of breakthroughs in behavioural and movement science, yet are still difficult to achieve in natural contexts, and are unobtainable for many researchers due to implementation and infrastructure costs. Here we present a framework to enable the utilization of these cutting-edge approaches in aquatic ecosystems, at low-cost and for users of different backgrounds. Our proposed tracking method is flexible in both the conditions of use, and the study species being examined, vastly expanding our potential for examining non-model systems and species. In combination with the genomic revolution, allowing sequencing in a matter of days, state-of-the-art behavioural sequencing under field conditions will revolutionize the field of movement ecology and evolutionary behavioural ecology. The approach we advocate here can further integrate the study of wild animal behaviour with modern techniques, facilitating an integrative understanding of movement in complex natural systems.