Introduction

Achilles tendon ruptures (ATR) are career-threatening injuries in elite soccer players. Unfortunately, despite a relatively high return to play (RTP) rate, 96%, according to Grassi et al. [11], 18% of players will not return to the same level of competition within two seasons following injury [34]. Furthermore, previous research has also shown that soccer players suffering from these injuries have their careers shortened, on average, by two seasons compared to matched controls [30].

Several studies reporting outcomes of ATRs in elite athletes are based on publicly available information [11, 13, 15, 24, 26, 30, 33, 34]. In soccer, one notable source is transfermarkt.com [10, 11, 16, 21, 34, 36], which has been considered accurate, regarding injury denomination and location, in 89% of cases [8, 16]. Although primarily aimed at aggregating player market values and transfer fees, it includes other valuable data for sports analytics, such as match results, player performance indicators (namely goals, assists, and fouls), and injury history. This database is publicly available and maintained by transfermarkt.com and its user community [32].

Artificial intelligence is a field that studies artificial agents that can mimic or surpass human-level intelligent tasks and has become increasingly popular in the past decade [7]. Machine Learning (ML) is a subset of artificial intelligence related to “advanced statistical techniques that use computer algorithms to model complex relationships between variables”, with these computer algorithms learning automatically from experience, i.e. data, without direct human intervention [20]. These algorithms rely on data analysis models to uncover hidden patterns and other meaningful insights from large datasets [28]. Among these algorithms, one can find both unsupervised and supervised learning methods [1]. Unsupervised learning is used when “labels” are unavailable [1], i.e. individual instances in the dataset are not categorized. These algorithms can organize individual instances according to naturally emerging patterns in the dataset, detect anomalous patterns and perform dimensionality reduction [1, 7]. Supervised learning is used when data are “labeled”, i.e. the algorithm is fed training data where individual instances—observations—and corresponding output values, obtained with human intervention, are known [1]. Regression and classification problems are the two main categories into which supervised learning can be divided [1, 7].

Despite recent advances in the characterization of consequences of ATRs for elite soccer players [10, 11, 30, 34], both an evaluation of how match participation evolves after injury and a set of prognostic tools to gauge the likelihood of return to the same level of play are still missing in the literature. In addition, previous studies of elite soccer players treated for ATRs have also been limited by their reduced number of cases under consideration [10, 30, 34], by being restricted to a single league [10, 30], or by missing performance measures besides the return to play at the same competitive level [11].

This study has a double objective. Firstly, an exploratory data analysis aims to inform athletes and staff how match participation evolves after ATRs. Secondly, it evaluates the performance of an ML model based on pre-injury features to predict whether a player would return to a similar level of match participation, together with a study of the most relevant features for this task.

Materials and methods

Player screening and selection

The website transfermarkt.com (Transfermarkt, Hamburg, Germany) was mined, between January and March of 2021, for relevant entries regarding soccer players who suffered an ATR while playing in first or second leagues.

A customized web scraper was developed using Scrapy [22]. Player screening and selection were carried out using the following scheme: fir. Firstly, a list of all first and second leagues across the world was manually compiled; secondly, team rosters for each team in each league, since season 2007/2008, were extracted to a list; finally, the injury data of each player in the list were retrieved. The resulting injury data were filtered for entries containing the string “Achilles tendon rupture” or “Achilles” combined with more than 90 days of absence. Another group of players with absence times of more than 90 days was built from the following strings: “calf”, “leg”, and “ankle tendon”.

Each entry was then evaluated independently by two researchers. Only players with club reports, press releases, or interviews mentioning a complete ATR were eligible for inclusion. A minimum follow-up of 24 months was also required. Due to the COVID-19 pandemic and ensuing match calendar rearranging, only injuries occurring before 31st Marcht of 2018 were included. Players that suffered partial or focal tears of the Achilles, and players that suffered an ATR while playing for teams not in first or second leagues, or were unaffiliated with any team at the moment of injury, were excluded. Disagreements were settled by discussion with a third researcher on a case-by-case basis.

Data extraction and dataset handling

The transfermarkt.com website was also scraped for the following items: date of birth, height, preferred foot, playing position, club transfers (including projected market values and transfer fees), whether the player had played for the national team (at any time during the player’s career), date of clearance for unrestricted practice, and match participation data (as minutes on the playing field; for the season of injury, the preceding season, and the two seasons following injury). Specific match participation data included: minutes played, whether the player was in the starting team, whether the player did not play but sat on the bench, and the reason for not playing (medical injuries, coach choice, or other). Data were anonymized, pooled into a database, inspected, and formatted for consistency. In cases where players sustained bilateral ruptures, the first rupture was considered the index event.

Dealing with missing data

Missing data regarding minutes played per match were imputed using spline interpolation. In addition, missing values regarding categorical features related to match participation (reason for player absence from the playing field and whether the player was in the starting eleven) were imputed using backfilling. Of note, less than 0.01% of matches had missing information.

Feature engineering

The following features were computed from the available data: age at rupture, relative market value (obtained from the division of the player’s market value by the squad total market value), whether a re-rupture or a contra-lateral rupture happened, whether there were other preceding or following Achilles Tendon (AT) problems, date of the first official match participation following rupture, whether the player retired, changed clubs or was left without club within the 2 years following injury, minutes and matches played in the 24- (Year − 2) and 12-months preceding (Year − 1), and 12- (Year 0) and 24-months (Year 1) after injury. In addition, to account for discrepancies in playtime available, players’ data related to match participation was averaged by the number of matches played by the team in 30-, 90-, 120-, 180- and 360-day intervals.

Additional feature engineering was then performed, leading to the creation of the following features: the player’s market value multiplied by the average minutes played per match in Year − 1, the market value of the team multiplied by the player’s average minutes played per match in Year − 1, the difference in minutes played per match in Year − 1 and Year − 2, how many days had elapsed since the player joined the team when the injury happened and the number of months elapsed since the beginning of the season when the injury occurred.

Machine learning model development and calibration

Unsupervised and supervised machine learning models were trained and evaluated using the Python SciKit-Learn library on the Google Colab platform [2, 25].

The difference between average minutes played per match during Year 1 and Year − 1 (ΔMPM) was used to survey patterns in match participation after injury. Clustering analysis was performed using k-means clustering [1]. The optimal number of clusters was determined using the silhouette score [27], which varies between − 1 and + 1, and evaluates how similar data points are to their clusters compared to other clusters. A value of 0 represents overlapping clusters, and negative values signify that data points have been assigned to the wrong cluster. The silhouette score is frequently used to assess clustering quality, in the absence of a standard method in the research community [27]. Cluster stability was evaluated by repeatedly randomly dividing the main dataset into training and test datasets (number of repeats: 100; train/test split: 50/50) and measuring the similarity of the resulting clustering with the Adjusted Rand Index and Fowlkes–Mallows scores, using the main dataset cluster labels as ground truth.

The post-injury match participation level was predicted using the XGBoost classification algorithm [6, 12]. Continuous variables were scaled with standardization. Feature selection was performed using forward selection, in which the model is started with no features, and features are added sequentially and kept if results are improved. Model outputs were subjected to cross-validation using a ten k-fold strategy [19]. In a stratified ten k-fold cross-validation, 90% of the dataset is used to train, and 10% is used to evaluate the model. The procedure is repeated ten times, each with a different train/test split until the entire dataset has been used as the test set. The model was evaluated using the area under the receiver operating characteristic curve (AUROC) and Brier score loss. A representation of the machine learning processing pipeline can be found in Fig. 1.

Fig. 1
figure 1

Machine learning processing pipeline. AUROC area under the receiver operating characteristic curve

Statistical analysis

Statistical analysis was performed using Python libraries Statsmodels and SciPy. Except otherwise specified, values are presented as means and standard deviation. Groups were compared using Student’s t-test, Kruskal–Wallis, or one-way ANOVA (depending on the number of groups and whether data followed a normal distribution). The assumption of normality was tested using the Shapiro–Wilk test. The Pearson’s correlation coefficient was used to explore potential correlations between variables. Statistical significance was set at p < 0.05. Sample size calculation was not performed for this study.

Results

The scraping process retrieved 748 entries. After applying exclusion criteria, 209 players were selected for analysis. Detailed information regarding the screening and selection process, with exclusion criteria, can be found in Fig. 2.

Fig. 2
figure 2

Player screening and selection flowchart, with exclusion criteria

Player demographics and baseline characteristics

Data related to player demographics and baseline characteristics can be found in Table 1. The mean age at rupture was 28.2 ± 4.0 years (range 20–40).

Table 1 Player demographics and baseline characteristics

Return to competition and career changes

Players were cleared for unrestricted practice after a mean of 223 ± 129 days (range 92–1553). The first post-injury match was played after a mean of 287 ± 136 days (range 106–825).

Fourteen players (6.7%) did not play any match after the AT injury and subsequently retired, with five more players retiring within 2 years after injury, for a total of 19 (9.1%). Three other players (1.4%) had their contracts expire and were left without a club sometime in the 2 years after injury. Thirteen players (6.2%) changed clubs within the 2 years following injury, with nine changing to teams playing below second league (4.3%).

Re-ruptures and other Achilles tendon issues

Ten players sustained re-ruptures (4.8%). These re-ruptures occurred after a mean of 621 ± 532 days after the index injury (153–1634). Six players (2.9%) sustained contra-lateral ruptures at some point in their careers. Eight players (3.8%) had a recording of previous AT problems, and 16 players (7.7%) had another time-loss injury (other than re-rupture or contra-lateral ATR) related to AT problems after the index injury.

Exploratory analysis of match participation data

Data from 32,853 matches were analysed. The average minutes played per match was 48 ± 25 in Year − 2, 46 ± 24 in Year − 1, 11 ± 13 in Year 0, and 32 ± 25 in Year 1. Players were in the squad in 64.1 ± 26.2% of games in Year − 2, 62.6 ± 25.0% in Year − 1, 17.9 ± 18.1% in Year 0, and 47.0 ± 29.4% in Year 1. Players were in the starting eleven in 53.5 ± 28.0% of games in Year − 2, 51.6 ± 27.0% in Year − 1, 12.0 ± 15.1% in Year 0, and 35.3 ± 28.7% in Year 1. These differences were statistically significant (p < 0.001) for all comparisons except between Year − 2 and Year − 1. A plot of average minutes played per match throughout the study time frame, computed in 30-day intervals for each playing position, can be seen in Fig. 3.

Fig. 3
figure 3

Plot of average minutes played per match (y-axis) for all players included throughout the study time frame and computed in 30-day intervals (x-axis) per playing position. Shaded areas correspond to standard deviation

The Pearson’s correlation coefficient showed a small inverse correlation between days until clearance for unrestricted practice and the ΔMPM (r = − 0.2; 95% confidence interval − 0.33 to − 0.07; p < 0.01). A very small positive correlation was found between days elapsed since injury until first match played and the ΔMPM (r = 0.13; 95% confidence interval − 0.01 to 0.26; n.s.). After removal of outliers (those with values above 500 days; n = 18), this correlation was 0.2 (95% confidence interval 0.06–0.33; p < 0.01). Finally, a small positive correlation was also found between the number of days from clearance for unrestricted practice to first match played and the ΔMPM (r = 0.24; 95% confidence interval 0.11–0.36; p < 0.001).

Clustering analysis

The optimal number of clusters was four. The silhouette score was 0.55. The Adjusted Rand Index and Fowlkes–Mallows scores were 0.84 and 0.88, respectively. A plot of average minutes played per match for each cluster, computed in 30-day intervals, can be found in Fig. 4. The main characteristics of clusters and respective statistical comparisons can be found in Table 2.

Fig. 4
figure 4

Plot of average minutes played per match (y-axis) throughout the study time frame and computed in 30-day intervals (x-axis) for each cluster. Shaded areas correspond to standard deviation

Table 2 Main characteristics of clusters and statistical comparisons

Prediction of post-injury match participation

Players were divided into two groups based on whether they suffered a decrease in match participation while comparing average minutes per match in Year 1 and Year − 1. Players were assigned to Group 1 if they showed a decrease larger than 15 min played per match, and this difference was more than 20% of the value in Year − 1 (decreased match participation, n = 103). Otherwise, they were assigned to Group 2 (maintenance or improvement of match participation, n = 106). These designations were used as classification labels to train a ML classification algorithm. A list of included features and relative feature importance can be found in Table 3. After cross-validation, the average model AUROC was 0.81 ± 0.10, and the Brier score loss was 0.12.

Table 3 Features included in the predictive model and their importance

Discussion

The most important findings of this study were: most players gradually increased match participation during the first year after injury, with goalkeepers still improving after 2 years; and the ML classifier displayed good performance predicting whether a player would return to a similar, or even improved, level of match participation, with the most important features being related with pre-injury performance.

Plateauing of post-injury match participation occurred approximately 1 year after injury for forwards, midfielders, and defenders. Goalkeepers kept increasing playing time throughout the 2 years following injury, albeit at a slower rate. Of note, previous research has shown that outcomes after ATRs improve for at least 1 year after injury [3, 4], possibly due to a need to adapt to biomechanical changes in the lower limb resulting from tendon elongation [9]. Another critical aspect to consider is that psychological factors may be involved [29, 35], in which players need to regain confidence in their abilities and overcome the fear of re-injury.

Differences in match participation between Year − 1 and Year 1 were the subject of clustering analysis. A silhouette score of 0.55 was found for the optimum number of clusters, which denotes moderate cluster separability. In addition, good clustering stability was found through the Adjusted Rand Index and Fowlkes–Mallows scores, meaning that these clusters were relatively consistent, even when only subsections of the dataset were randomly evaluated.

Younger age has been previously recognized as a favourable prognostic factor after ATRs in soccer players [10]. However, this point is controversial since other studies have not found statistically significant differences regarding age in players with favourable versus unfavourable outcomes in soccer [34], American football [24], basketball [15], and baseball [26]. In this study, the average age was lower in clusters C and D (maintenance or improvement of match participation) than clusters A and B (decreased match participation).

The number of days the player has been with the team at the time of injury is a previously unrecognized prognostic factor in ATRs. In this study, it was found that players in Cluster A were with the team for a significantly longer time (1060.1 ± 1287.6 days) compared with the remaining cohort (p < 0.01). The longer time with the team (or since the last market transfer) may signal a different career context for these players. For example, their contracts may be near expiration, and prospects of joining another team are dim. Coincidentally, players in this cluster also retired within 2 years in a statistically significant higher proportion than the remaining cohort (26.5% versus 5.7%; p < 0.01).

Players in cluster D took a significantly longer time before playing their first official match compared with the remaining cohort (315 days ± 119 versus 264 days ± 159; p < 0.05), despite similar time intervals from injury to unrestricted practice (207 days ± 55 versus 222 days ± 140; p = 0.72). Therefore, it can be speculated that by allowing these players more time to recover, they made their comeback at a higher performance level—closer to the full recovery potential—which would be perceived as a superior recovery from injury, encouraging increased match participation. In addition, players in Cluster A (those with the most significant decrease in average minutes played per match in Year 1 compared to Year − 1) showed the shortest time until first match played. However, statistical correlations between days until unrestricted practice or first match played and the ΔMPM were small (albeit statistically significant). Further research is required to determine how a delayed return to competition may relate to improved outcomes after ATRs.

A ML classifier was trained, with an AUROC of 0.81 ± 0.10 after cross-validation, through careful feature engineering and selection, which translates as good discriminating performance [23]. The model’s performance was also evaluated regarding output probabilities using the Brier score loss, as it was deemed helpful for players and staff to gauge these against their individual beliefs and experiences. It should be noted that only pre-injury features were used to train the model, and no data regarding treatments was available. Of note, since features related to pre-injury match participation showed the highest feature importance, it can be inferred that the future level of match participation is related to the sporting context at the time of injury, directly or indirectly (e.g. a tendency for early RTP in high-performing players which may reflect negatively in match participation afterward).

The use of ML algorithms to predict sports injuries is a current trend in research [14, 17, 31], but practitioners should remain cautious regarding their use despite recent advances. There are ethical implications to consider [5], such as inadvertently hindering a player’s career through a wrongfully attributed worse prognosis. Model results may also be overly optimistic, either due to overfitting (when the model is fitted too close to a particular set of data and becomes unable to make good predictions in a generalized environment) or accidental data leakage (when information contained in the test set is wrongfully fed to the model during training). Nevertheless, the increasing accessibility and ease of use of ML tools and development frameworks offer an excellent opportunity to improve the care of musculoskeletal injuries, though researchers and clinicians should stay vigilant about its shortcomings.

The main limitation of this study is the inability to confirm the diagnosis. However, all included cases were manually double-checked using other sources by two researchers independently to avoid the inclusion of misclassified injuries. Other limitations are the unknown measurement accuracy of match participation data found on transfermarkt.com, the unavailability of treatment data, and the lack of a strictu sensu measure of player performance.

This study can guide the objectives and expectations of athletes and staff regarding how match participation evolves after an ATR, noting that it takes approximately 1 year to reach its peak (except for goalkeepers, who may keep improving for at least 2 years). In addition, the cluster of players with improved match performance showed a statistically significant increase in the number of days until first match played compared to the remaining cohort. Also, a small but statistically significant positive correlation was found between time until first match played and the ΔMPM. Finally, recent research has shown improved outcomes in patients undergoing slowed-down rehabilitation programs [18]. Thus, it may make sense to prioritize recovery of lower limb strength and sport-specific skills over an early return to competition.

Conclusion

Exploratory data analysis revealed that forwards, midfielders and defenders increased match participation during the first year after injury, with goalkeepers still improving at 2 years. Good performance was attained using a ML classifier to predict the level of match participation following an ATR, with features related to pre-injury match participation displaying the highest importance.