Advances in biologging technologies and computing power have enabled biologists to pry into the daily existence of many difficulties to observe animals [1, 2]. A powerful new approach is to create ethograms from accelerometers using machine learning [3]. Accelerometers measure the inertial acceleration of an animal while moving, most commonly on three axes [4]. Unique combinations of these three axes over a period of time identify specific movements that correspond to a single behaviour or series of behaviours. Binary classes of behaviour can be identified with high degree of accuracy using machine learning, e.g. prey captures in penguins using support vector machines (SVM’s) [5]. A variety of machine learning algorithms have attempted to distinguish between multiple classes of behaviours, some successfully [e.g. 610] and some with less success [e.g. 11]. There are a number of reasons why machine learning methods may not be able to classify data from accelerometry accurately, including the number of categories to be classified [12], the duration of the sample of behaviour to be classified [13], the number of sample behaviours to classify from [14] and the machine learning method that is used [7]. Using too many categories of behaviour may affect the ability of the machine learning method to accurately classify all behaviour. For example, the attack/peck category created for crab plovers could not be predicted using decision trees from a study attempting to classify seven categories of behaviour [11]. Using hidden semi-Markov models, two categories of behaviours were able to be classified with much higher accuracy than three, four or five categories [12]. Accuracies of machine learning models are likely to improve when using fewer categories because the algorithm has fewer classes to distinguish between. This is especially true if the classes that are being combined are often misclassified as each other. For example, if we have three classes A, B and C, and classes B and C are often misclassified as each other, then combining them into one class will increase classification accuracy, at the cost of less detail overall. Alternatively, extending the sample of time from the accelerometry data used to classify behaviour can improve the overall accuracy machine learning methods by providing more samples overall [13].

The machine learning method selected to classify the data will also influence the overall accuracy [7, 12]. There have been several attempts to evaluate the accuracies of different machine learning methods [7, 13, 15]. However, due to vastly distinct dynamic movement of different animal species, it is unlikely that there will ever be a universal set template for creating ethograms from accelerometry [16, 17]. Instead, a new machine learning method described here may afford a solution to the problem of method selection. Super learning takes a set of candidate learners (other machine learning methods), applies them to a data set and chooses an optimal learner or combination of learners based on the resultant cross-validated risk [18]. The super learner model (SL) seeks to find the optimal combination candidate learners such that it will perform as well or better than any of the learner inputs [19]. Super learning has previously been applied to large medical data sets in order to make survival predictions with considerable success [20], but has until now not been evaluated for its ability to classify behaviour from accelerometry data.

The ability to reliably build highly generalizable models for the classification of animal behaviour will be a significant advance for the study of those species that are difficult or impossible to observe in the wild or sustain in captivity [6, 16]. Otariid pinnipeds, fur seals and sea lions, play an important role in the trophic interactions of many marine ecosystems [21], yet despite the importance of this group to understanding marine ecosystems, there is still much to learn about the behaviour of these and other marine predators [22]. Marine animals are very difficult to observe in the wild as they are active in remote locations and deep underwater where direct observation is often not possible, but being large and semi-aquatic, otariids are ideal candidates for remote observation using accelerometry [2, 23]. To reliably classify animal behaviours from accelerometry, it is necessary to evaluate the performance of different models and their parameters [7]. The aims of this study are twofold: (1) assess whether super learning can improve the accuracy of classifying accelerometry data in general and (2) identify the optimal time window and number of behaviour categories required to create reliable ethograms for a representative group of animals: fur seals and sea lions.



We conducted captive experiments at three Australian marine facilities: Dolphin Marine Magic, Coffs Harbour (RF1: −30°17′N, 153°8′E); Underwater World, Sunshine Coast (RF2: −25°40′N, 153°7′E); and Taronga Zoo, Sydney (RF3: −33°50′N, 151°14′E) from August to November 2014 and again at RF2 in August 2015. We used two Australian fur seals (Arctocephalus pusillus doriferus), three New Zealand fur seals (Arctocephalus forsteri), one subantarctic fur seal (Arctocephalus tropicalis) and six Australian sea lions (Neophoca cinerea) (Table 1). All seals were on permanent display at their respective marine facilities and were fed and cared for under the guidelines of the individual facility. All Australian sea lions in the study were born as part of an ongoing captive breeding programme in Australian aquaria. All fur seals came into captivity as juveniles after they were found in poor health or were injured and being deemed unsuitable for release back into the wild.

Table 1 Study species and characteristics of seal identification, marine facility, species, age, mass range, sex, number of trials and method of accelerometer attachment for fur seals and sea lions used in the study

Experimental protocol

We used a triaxial accelerometer (CEFAS G6a+: 40 mm × 28 mm × 16.3 mm, 18 g in air and 4.3 g in seawater, CEFAS Technology Ltd, Lowestoft, UK) to measure the movement of the seals. We used two attachment methods for accelerometers: either taped between the shoulder blades or secured in a custom-designed harness. Accelerometers were set to record at ±8 g and at 25 samples per second (25 Hz) on each axis. We recorded all trials continuously with one or two cameras (GoPro Hero 3—Black edition, USA; HDRSR11E: Sony, Japan), and trials had a maximum duration of 2.5 h. Videos were scored to an ethogram consisting of 26 unique behaviours developed previously [14]. We time-matched the videos and the accelerometry output to generate annotated acceleration data sets.

Behaviour segmenting

We grouped the 26 behaviours into broader behavioural categories. As the number of behavioural categories used to classify behaviour may affect the overall results, the analysis was run twice using four (feeding, grooming, resting and travelling) and then six categories (feeding, foraging, thrashing, grooming, resting and travelling) (Table 3; for a description of the individual behaviours in each of the categories please see [14—S1 File]). We also compared the ability of the model to discriminate behaviours over a range of discrete periods. We tested four epochs (number of accelerometer samples): 7 (0.28 s), 13 (0.52 s), 25 (1 s) and 75 (3 s) [24]. Behaviours could also be “contaminated” where two behaviours occur in the same time window. In these cases, we used the dominant behaviour with resultant windows of uneven time duration.

Summary statistics

We created 147 summary statistics as the inputs to the machine learning models. Most were summary statistics created from the x, y and z inputs (described below), and a few related to the animal or the behaviour including where the behaviour occurred (surface, underwater or land), device attachment method (harness or tape), age, mass, sex and species of the individual [14]. The location of the behaviour was determined by observation; however, in the wild, it can be using a combination of depth and the wet/dry sensor on the accelerometer (M. Ladds, M. Salton, R. McIntosh, D. Hocking, D. Slip, R. Harcourt, unpublished observations). For each of the three axes (x, y, z), we calculated mean, median, minimum, maximum, range, standard deviation, skewness, kurtosis, absolute value, inverse covariance and autocorrelation trend (the coefficient derived from a linear regression) and the 10th and 90th percentiles. We also calculated q as the square root of the sum of squares of the three axis [7] and included pairwise correlations of the three axis (xy, yz, xz) [25]. The inclination and azimuth were calculated as per Nathan et al. [7]. We calculated dynamic body acceleration (DBA) by using a running mean of each axis over three seconds to create a value for static acceleration [26]. We then subtracted the static acceleration at each point from the raw acceleration value to create a value for partial dynamic body acceleration (PDBA). We calculated overall dynamic body acceleration (ODBA) [26, 27] using

$${\text{ODBA}} = \left| {X_{\text{dyn}} } \right| + \left| {Y_{\text{dyn}} } \right| + \left| {Z_{\text{dyn}} } \right|$$

We calculated vectorial dynamic body acceleration (VeDBA) [28] using

$$VeDBA = \sqrt {X_{dyn}^{2} + Y_{dyn}^{2} + Z_{dyn}^{2} }$$

We calculated the area under the curve for both ODBA and VeDBA using the package “MESS” in R [29, 30]. The minimum, maximum and 10th and 90th percentiles were calculated for PDBA, ODBA and VeDBA.

Classification models

There are many candidate models suitable for classifying behavioural data obtained from accelerometry [7], and choosing the most appropriate method for the data in question can be complicated and time-consuming. The super learner model (SL) combines candidate models (other machine learning models, henceforth referred to as base learners) by applying a selection of them to a set of data and then weighting all of these learners through another learner. The optimal combination is chosen based on cross-validated risk [18, 31]. The base learners chosen for this study were: random forests (RF), gradient boosting machine (GBM) (both of which have previously been demonstrated to effectively classify this type of data well [14]) and a baseline model, logistic regression (LR) to which performances of the other models could be compared. Logistic regression was included as a baseline model as it is well tested, easy to implement and unlikely to overfit. Each base learner was trained across a set of parameters, with the predictions of each model kept. These predictions, plus the raw data, then became the inputs to the SL. The SL then learned from the predictions of the base learners as well as the summary and feature statistics to predict the outcomes.

For each of the models, data were split into a train (evaluation) and test (validation) set using 70 and 30% of the data, respectively. In total, the models were trained on ~90,000 individual data points or roughly ~13 h of coded data. Note that the test data were not seen by the model during training. This ensured that the scores obtained from the models reflected the ability of the model to predict from data outside training. Results of the model were reported as cross-validation scores and out-of-sample scores, which include accuracy and kappa (Additional file 1). Accuracy was the proportion of true positives identified by the model, while kappa was employed as more than two observers were used to classify data, thereby providing a measure for the fact that some of their observations will agree or disagree by chance [32]. This value was used to assess agreement of observed and predicted values in the confusion tables [24]. Precision and sensitivity are reported in the confusion matrix (Table 4) where precision is defined as the proportion of predictions from a behaviour category that were actually that behaviour, and sensitivity is the proportion of behaviours from a category that were classified as that behaviour [16].

Parameter grid search

Within each model, there were a number of parameters from which models can be trained. Samples of each of these parameters were chosen, and each model was run through every combination using a grid search (Table 2; Additional file 2). We evaluated best parameter grids of each model using H20 [33] for GBM and RF, glmnet [34] for LR and the SL. All analyses were run using R [30].

Table 2 Parameters for the four models tested


Triaxial acceleration data were collected from 12 seals over a range of trials lasting in duration from 10 min to 2.5 h (Table 1). From these we were able to mark 7525 bouts of behaviour, split into either four or six categories (Table 3).

Table 3 Number of unique behaviours observed from video analysis for each category of behaviour

Comparing model performance

All three test models (SL, RF and GBM) had significantly higher accuracies across the range of epochs and categories of behaviour tested compared to the baseline model (LR; Fig. 1). The SL accuracy ranged from 71.6% (7 epochs) to 73.6% (13 epochs) accuracy for six categories of behaviour and from 83.4% (25 and 75 epochs) to 85.1% (13 epochs) accuracy for four categories of behaviour (Additional file 2). The RF achieved slightly less accuracy ranging from 82.3% (75 epochs) to 84.4% (13 epochs) for four categories and from 67.8% (75 epochs) to 72.7% (13 epochs) for six categories. GBM performed slightly less well than the SL and about the same as the RF with accuracies ranging from 70.9% (75 epochs) to 73.4% (13 epochs) for six behaviour categories and from 82.0% (75 epochs) to 84.7% (13 epochs) for four categories of behaviour. The LR accuracies were significantly below all of these for all categories ranging from 74.1% (7 epochs) to 77.0% (75 epochs) for four categories and from 63.2% (75 epochs) to 65.1% (13 epochs) for six categories.

Fig. 1
figure 1

Classification accuracy from cross- and out-of-sample validation of four different machine learning algorithms. Coloured points (blue four-feature models; orange six-feature models) represent out-of-sample accuracy witherror bars of ±1 SD. Red bars represent cross-validation accuracy for each associated model. SL super learner, RF random forest, GBM gradient boosting model, LR logistic regression

SL classified categories of behaviour with higher accuracy and lower variance than both RF and GBM across all epochs (except GBM 7 epochs, six categories). The variance was reduced by ~70% across all model combinations tested, and accuracy was improved by between −0.1 and 10.1% (Fig. 1; Additional file 1). The variances obtained from the logistic regression models were similar to the SL. Accuracy and precision of all models improved when using four as opposed to six categories of behaviour. Looking at the overall performance of the models from the highest cross-validation score, out-of-sample score and the kappa score, we concluded that using 13 epochs produced the best results across the four models (Additional file 1).

Identifying categories of behaviour

Across all models and epochs, grooming and resting classified with the highest accuracy, with grooming generally outperforming resting (Fig. 2; Additional file 2). Examining the confusion matrix from the best performing model (SL—four behaviours, 13 epochs), the classification errors from the four categories of behaviour revealed that foraging often misclassified as travelling and vice versa (Table 4). Overall, within the test models (SL, RF, GBM), all four behaviours were correctly classified more than 75% of the time (Fig. 2). Within the six behaviour categories, the main misclassification stemmed from feeding, where only the super learner classified it correctly more than 50% of the time. The “thrashing” category that was also added to the model was classified with high accuracy (>75%). Resting and grooming maintained their high predictive accuracies across the test models (>80%). Foraging also maintained a reasonably high rate of classification (>70%), while travelling lost around 10% accuracy when compared with the four behaviour models.

Fig. 2
figure 2

Classification accuracy of behaviour across epochs and models. Four (a) and six (b) categories of behaviour were tested across four (SL, RF, GBM and LR) models across four (7, 13, 25, 75) epochs

Table 4 Confusion matrices from three test models using four behaviours and 13 epochs


The aim of this study was to assess whether super learning would improve the predictive ability of base learners (RF, GBM and LR) to classify behaviour from free-living animals using accelerometry. While building machine learning models, a number of choices must be considered about how to segment the data. We evaluated several combinations of time segmentation and number of behaviour categories for this type of accelerometry data. Using super learning increased the accuracy of the models, albeit only slightly, and reduced the prediction error when compared with RF, GBM and the baseline model—LR. Shorter time windows (<13 samples) and fewer categories of behaviour (4 vs. 6) were better at predicting the behaviour state.

Number of behaviour categories: Less is more?

Four behavioural categories had a higher classification rate than six behaviours. At its most basic, accelerometers discriminate between two behavioural states (e.g. activity vs. resting or swimming vs. prey capture) and can do so accurately [5, 35]. Adding more categories for the model to discriminate increases complexity, but reduces the uniqueness of the model, thus decreasing its overall accuracy [12, 13]. There is also a greater chance of overlap with other behavioural categories. Increasing behaviour categories from four to six produced an overall average 11.5% (range 9.5–14.5%) decrease in accuracy. The optimal number of categories becomes a trade-off between useful ecological information and high accuracy. Reducing the number of categories broadens the scope of the remaining categories as more similar behaviours are considered together and are thus easier to discriminate by the model. An important distinction to make is that considering fewer categories does not mean removing behaviours from the models, because if those behaviours are observed in the wild, the model will still try to classify them, resulting in an inaccurate representation of what the animal did while being monitored (for a discussion of this issue see [14]). As the loss in accuracy is so small, this leaves it up to the researcher to determine whether quality (fewer behaviours—more accuracy) or quantity (more behaviours—less accuracy) is important in the study. In this illustration of the method, which is broadly applicable to all free-living animals that can be equipped with accelerometers, we used fur seals and sea lions. For species such as these, four behavioural categories appear to be the minimum that provides meaningful information about their activities. In future studies that use this method, the number of categories must be tailored to the species concerned and aims of the study.

Epoch size: Smaller is better?

We found that smaller epochs gave better overall predictions, and that the length of the epoch was significant in predicting different categories of behaviour. Increasing the window size reduces the sample size, which likely decreases the overall ability of the models to predict accurately. Having smaller epochs increases the sample size and reduces the chances of the model overfitting. Large epoch sizes are also more likely to capture more than one behaviour, increasing the difficulty for the model to distinguish between classes. In contrast to our results, a study of cow behaviour found that longer epochs tended to perform better than shorter epochs (5 and 10 vs. 1 min) [13]. However, a similar study with humans discovered that epochs of one to two seconds had the best precision values [36]. They also found that epoch length significantly affected the overall accuracy of individual behaviours, which concords with our findings. We found different prediction accuracies by adding thrashing and feeding to the model. All models predicted thrashing with high accuracy (~75%), while only the SL predicted feeding with more than 50% accuracy. Thrashing is a very distinctive behaviour, with accelerometer readings exceeding 4 g; very few other behaviours have this feature. By contrast, we defined feeding as a seal taking fish out of the water column, and animals were swimming while taking fish; therefore, it was difficult for the models to distinguish between these two behaviours. Therefore, any additional behaviours added to the base four-category model need to be clearly distinct from any other behaviour. Future studies investigating seal feeding behaviour should seek to gather examples of seals capturing live prey.

Super models: Is it worth it?

The idea of a super machine learning model is enticing, allowing a multitude of machine learning models to be trained and tested on a single set of data and thus allowing the model to optimally combine each of the individual models to give better overall predictions. Super learning has been successfully used in medical research [20] and spatial analyses [19] and improved the behaviour classification models from accelerometry, albeit marginally. With the exception of a single model combination (GBM; 7 epochs, 6 features), the super learner performed better than any other model combination. This was expected as super learning will use the optimal model it has trained on if it is unable to compute a more optimal solution [19]. We found an average increase of 3.4% (range −0.1 to 10.1%) in the classification accuracy of the models using super learning. While any improvement in model performance is welcome, single-state-of-the-art algorithms like GBM are easy to implement in software environments like R. However, this research has only investigated a small aspect of the potential of super learner models. Super learners are unrestricted by the number and type of models that constitute the base learners, so can be optimized for the type of data that is input. This is particularly useful if researchers are interested in a particular behaviour that is usually difficult to distinguish with a single model (i.e. attack/peck in plovers [11]) or very high accuracy is imperative for the research objectives. We suggest the individual researcher takes this into consideration when deciding whether the additional human and computational time required to implement super learning will be beneficial for their behavioural data study.


This study evaluated a number of machine learning methods to classify accelerometry data and compare them to a new method—super learning. We found that super learning improved the accuracy and reduced the variance in the predictive accuracy of the model. We showed that the epochs (number of samples) and number of behavioural categories influenced the overall accuracy of the model. This study demonstrates the importance of evaluating all options when using machine learning to classify animal behaviour. While this is by no means an exhaustive demonstration of the possible choices to be made when implementing machine learning methods, the options highlighted here (number of behaviour categories, epoch size, model selection and parameter grid search) are some of the most important and easiest to test when conducting this type of statistical analysis. Future studies classifying animal behaviour from accelerometry using machine learning should, where possible, test their models across a selection of these options in order to obtain the highest accuracies.