1 Introduction

With the increased availability of depth camera technology and movement analysis tools, the use of these tools, such as Microsoft’s Kinect sensor and SDK that track body movements and allow their reproduction in videogames and other applications, has been more and more explored, including to aid in learning and practicing activities for which movement is essential and may even be seen as a mark of quality, personality and individuality [1]. During learning and training, it is important to have some measure of quality of performance that is as precise as possible as a form of feedback, particularly if it can detect and show where the mistakes happened and even suggest how to correct them.

The development of systems using depth sensors to aid in sports, dancing and martial arts has been gaining prominence and examples of this are the work of Chye, Connsynn and Nakajima [2], which uses Kinect to complement the training of martial arts beginners, and that of Hachaj, Ogiela and Piekarczyk [3], which uses a gesture description language to practice combat and Shorin-Ryu Karate techniques with a reduced risk of trauma. Another application of this technology in this context is distance training for dance and martial arts practitioners using virtual avatars [4].

This paper presents a Systematic Review of the use of depth cameras for the analysis of choreographed human movements in the past years, discussing aspects such as the techniques used for analysis, equipment and setup, applications and interaction.

Kinect is a sensor developed by Microsoft, initially for its Xbox 360 videogame console and later for computers and is composed of two cameras, one RGB and a depth camera that uses an infrared projector which can measure from 0.8 to 3.5 m[5]. The widespread use of this sensor today happens mainly for two reasons, its relatively low cost and high availability and this review focuses on its use.

The paper is organized in a simple manner, as follows: Sect. 1 this introduction and the Sect. 2 describes in detail the methodology adopted for the systematic review. One problem that is often considered in the development of most of these systems is the temporal alignment of movements to facilitate the comparison of those performed by the user with those of another person or some known dataset. Two techniques used for this task are prominent in the literature, Dynamic Time Warping and Hidden Markov Models, both share similarities [6], will be discussed more frequently and, thus, are briefly introduced in the Sect. 3, along with a couple other techniques. The Sect. 4 presents and discusses the work’s results and Sect. 5 one brings it to a conclusion.

2 Methodology

Systematic Review (SR) is a form of research in the literature performed in a standardized way, often performed to collect and classify the work done in a specific area or regarding a specific question and to show the state of the art in that area, providing a synthesis of the research regarding that question and its main results up to that point in time [7]. SR follows strict criteria so that its results are trustworthy, reproducible and validatable. Before the review is performed, several of its aspects must be decided and recorded, such as the research questions it must answer, control papers that it should find, databases to be searched, search strings, inclusion and exclusion criteria, what information will be extracted from each work and how it will be summarized. Below we summarize the most important of these aspects.

2.1 Research Questions

Every SR has, as a starting point, research questions that delimit the problem and act as an initial filter for the works found and that must be answered by the end of the process. The questions used in this work are:

  1. 1.

    What methods are used to analyze and compare choreographed human movements (mostly martial arts and dance, but not restricted to them) captured with depth cameras, particularly Microsoft’s Kinect?

  2. 2.

    What are the techniques, if any, for temporal alignment of the movements and to analyze their rhythm?

  3. 3.

    What are the main applications of these systems?

  4. 4.

    Is the quality of interaction in these systems, if it exists, analyzed? How?

2.2 Sources and Search Strings

The papers for this review were searched in the databases of the Institute of Electric and Electronic Engineers (IEEE), the Association for Computing Machinery (ACM) and Springer, all of which bring together much of the most important work in this area and have a friendly user interface to facilitate the search process. In each of these databases, six customized searches were performed. Table 1 summarizes the strings used for these searches (which were adapted as needed for each particular engine) and the total number of papers found in each search, followed by the number of papers that were selected or rejected after the application of inclusion and exclusion criteria, the number of duplicated papers and the final number of papers that were used to extract information for this review.

Table 1. Search strings and number of papers found

Observing this table we verify that 1402 scientific papers were returned using these keywords and search strings but only 20 were finally extracted for the SR. It is interesting to notice that, due to the option for doing independent searches instead of a single search with a complex and long string, many papers, almost half of them (648) were duplicated.

2.3 Inclusion and Exclusion Criteria

Many of the works found in the initial search were excluded for, ultimately, being outside the narrow scope we selected for this review. This process of inclusion or exclusion happened through the following criteria, predetermined in the research protocol:

  • Inclusion

    • Work that analyzes sequences of multiple gesture (movement) with metrics to qualify the movements and using depth cameras;

    • Work with metrics for temporal analysis of the movement sequences.

  • Exclusion

    • Work analyzing independent gestures;

    • Work analyzing semi random (not predetermined or choreographed) movement patterns;

    • Work that does not use depth cameras.

2.4 Support System

A free software system called “State of the Art through Systematic Review” (Start) was used in this work to store and organize the papers found in this review. It is a rather interactive tool with features such as duplicate filtering and .bib support developed by LAPES (Research Laboratory in Software Engineering) at Federal University of So Carlos, in Brazil, and we would like to extend our thanks to its creators.

3 Brief Description of Techniques

In this section we explain in a very succinct form the main algorithms used in the works included in this review to analyze movements: Hidden Markov Models (HMM), Dynamic Time Warping (DTW), Spherical Self-Organizing Maps (SSOM) and Gesture Description Languages (GDL).

  • HMM: stochastic model of temporal data series that represent the probability of the data occurring. The idea is that the process is unknown (hidden) but its results can be known. It is derived from Markovian chains and widely used in pattern recognition (including movement analysis), artificial intelligence and molecular biology [8, 9].

  • DTW: like HMMs, this algorithm is also based on temporal series, but it solves the problem of finding a common path between two series of different sizes but otherwise similar, without requiring initial or final points to be the same, creating a warping between the two paths and generally using euclidian distance [10].

  • SSOM: clusterization techinque that creates a spherical mapping to indicate tridimensional positions, searching for the neighbour that better fits the movement and creating a link to it [11].

  • GDL: used both for dynamic movements or static gestures, a script describes a movement or pose and, if recognized correctly by any other means, it is added to a heap, which may contain a chain of scripts or a single one [3, 12].

4 Results and Discussion

In this section, we will begin by characterizing the set of papers analyzed in our review. Figure 1 illustrates the year of publication of the papers found in this review and shows that this body of work is quite recent, with most of it (60 %) from 2013. Because we focus on Kinect and how it is making this sort of research and application more easily available, this was expected, since it was released for the Xbox only in November 2010 and for Windows only in February 2012 (although even before the Kinect for Windows release there were several alternatives explored to work with the sensor on personal computers).

Fig. 1.
figure 1

Years of publication

From a geographic point of view, Fig. 2 illustrates which countries are publishing research in this particular area, showing that none of the countries is too far ahead, with each being responsible for 5 to 15 % of published papers.

Fig. 2.
figure 2

Countries

If grouped by continent, however, as shown in Fig. 3, Asia pulls ahead significant (and the interest in both martial arts and computer vision in that continent is no surprise), followed respectively by Europe and America.

Fig. 3.
figure 3

Continents

Now we present the main results, according to the research questions listed previously. All studies made use of Kinect as a depth camera. Out of all of them, only two used more than one device in the experiment. Hachaj and Ogiela [13] used three sensors around a karate practitioner to aid in the learning process for martial arts techniques. It is interesting to note that the authors tested two distinct spatial configurations for the sensors. The first, less efficient, separated by an angle of \(\pi \)/2 and the second, more efficient, using an angle of \(\pi \)/4. Another work by the same group of Hachaj et al. [3] to verify the execution of karate moves compared the use of three sensors versus a single sensors, reporting an error of 13 % in movement capture with three cameras and 39 % with only one. For static poses or gestures, however, the difference between the two setups was not significant.

Different tools were used for image capture and skeleton fitting with the Kinect. Chye and Nakajima [2] use the OpenNI/NITE framework and draw a silhouette of the captured body to develop a game to aid karate practitioners in training. Other systems [1416] also use this framework to analyze movements to give dancers a post-exercise evaluation [14], compare dance movements to the Bashir method [15] and score karate moves [16]. Microsoft’s Kinect SDK was used in all other works (sometimes via a Unity wrapper), apparently being the most widely used alternative in this context. Table 2 briefly summarizes these papers.

Table 2. Work with Kinect SDK

In many cases the goal of the analysis was to compare movements between two performers (such as a novice and a master, or to measure the synchronicity of movement in a joint performance), or between one person and a pre-recorded video, using several distinct metrics. Other strategies involved recognizing specific and basic postures or gestures, for instance six basic ballet poses (as in the work of Sun et al. [11] using SSOM without much success to recognize poses or movements beyond those), or tracking user movements and mapping them to an avatar in a virtual world with virtual obstacles and such. Rhythm was often discarded in these metrics (possibly due to the temporal alignment strategies used), even in choreographed performances in which rhythm should indeed have some importance.

Merely using euclidian distance between feature vectors of positions often did not yield very conclusive results but including velocity as a feature and still using euclidian distance showed better performance. Kaewplee et al. [17] use only Euclidian distance without temporal alignment (but using posterior movements to aid in the calculation of articulation angles) to analyze 24 basic Muay Thay movements. Chye e Nakajima [2] also use Euclidian distance and, like the previous work, also faced some difficulty to compare movements because of that, due to even slight temporal variations. Saha et al. [18] attempt to minimize the problem by defining an ideal speed for each movement and only comparing movements that did not deviate much from that speed. The same group used this approach again to recognize Indian dance moves [18]. Translating movements into a common description, such as using the Gesture Description Language [13] to create movement scripts and them comparing them showed good results, with 90 % accuracy in recognizing karate movements and comparing them to those executed by a black belt expert, using a setup with three sensors [3]. Lin et al. [19] developed an algorithm, using 103 videos from a database, that only showed significant synchronization errors when the dancer stepped outside Kinect’s range.

More sophisticate algorithms for tracking and comparing temporal series were also explored, such as DTW, SSOM and HMMs. Using SSOM with articulation angles and captured body part lengths, Dancs et al. [20] mostly ignored rhythm while during training and obtained success rates of almost 90 % in leave-one-out and nearest neighbour validation and cross validation. Gupta and Goel [21] use DTW with Euclidian distance and Earth Mover’s Distance of finger positions to compare the performance of a subject and a master in Kathak. Zhu and Pun [14] used DTW to score dance practitioners comparing to the Taiji dataset and reached success rates above 80 %. Bianco and Tisato [16] also use DTW and report 96 % precision in recognizing and scoring the execution of karate movements, a similar value to that obtained by Pisharady and Saerbeck [22] using DTW to identify dynamic hand movements. Alexiadis and Daras [23] performed an experiment with and without the use of DTW, with showed a difference of over 20 % in favor of its use when comparing movements to the Huawei 3DLife/EMC Grand Challenge dataset. Keerthy [24] uses DTW in his Master’s dissertation to create a Kung Fu training assistant that compares student and master movements. HMM was another technique widely explored. Anbarsanti and Prihatmanto [25] obtained promissing preliminary results in modeling the Likok Pulo dance using HMMs and classifying six individual basic dance movements and one undefined movement with almost 95 % accuracy. Masurelle et al. [15] also used HMMs to classify dance movements from a salsa database called 3DLife, comparing the results with the Bashir technique and obtaining 74 % positive matches. Figure 4 summarizes the frequency of use of these approaches to compare and classify human movements.

Fig. 4.
figure 4

Comparison approaches

Only two papers described some form of analysis of quality of interaction, Holsti et al. [26], in an application to aid in trampoline jumping, used questionnaires to evaluate their system’s usability, with 90 % of users giving positive feedback despite complaining about the delay when showing the movement. Wada et al. [27] also analyzed the usability of their system to analyze kata positions with 89 % positive feedback.

5 Conclusion

The use of computer vision in several day to day applications is becoming more and more frequent, particularly with the popularization of smartphone cameras and depth cameras such as Microsoft Kinect’s, which was one focus of this systematic review when applied to analyzing choreographed human movements. In this context, most papers we found take advantage of the Kinect SDK instead of alternatives, euclidian distance between feature vectors containing joint positions or angles was often used but showed poor results, often due to differences in temporal alignment of the movements being compared, but could be improved limiting the range of performance speed to be analyzed or adding speed to the feature vectors. Comparing standardized descriptions for gestures, movements and performances instead of the raw data from the sensors was another approach found. Out of the set of more sophisticate techniques to classify temporal series, DTW was the most commonly used in this context and showed good results, followed by the use of SSOM and HMMs. The quality of interaction with these systems was seldom analyzed in the papers included in this revision.