Predicting Human Actions Taking into Account Object Affordances

Anticipating human intentional actions is essential for many applications involving service robots and social robots. Nowadays assisting robots must do reasoning beyond the present with predicting future actions. It is difficult due to its non-Markovian property and the rich contextual information. This task requires the subtle details inherent in human movements that may imply a future action. This paper presents a probabilistic method for action prediction in human-object interactions. The key idea of our approach is the description of the so-called object affordance, the concept which allows us to deliver a trajectory visualizing a possible future action. Extensive experiments were conducted to show the effectiveness of our method in action prediction. For evaluation we applied a new RGB-D activity video dataset recorded by the Sez3D depth sensors. The dataset contains several human activities composed out of different actions.


Introduction
In everyday life a human performs various actions.Being able to detect and anticipate which action is going to be performed in a complex environment is important for assistive robots, social robots and healthcare assistants.Such ability requires reasoning tools and methods.
With such capability [20], a robot is able to plan ahead with reactive responses together with avoiding potential accidents.When a partial observation is available, we should be able to predict what is going to happen next (e.g., a person is about to open the door as shown in the Fig. 1).
Predictive models are also useful in detecting abnormal actions in surveillance videos with alerting emergency responders [38].It is necessary that a reliable prediction is done at the early stage of an action, e.g., when only 60% of a whole action was observed.
The preliminary version of the paper presented during "Intentional workshop on Robot Motion Control (RoMoCo)", 2017, Poland.Vibekananda Dutta vibek@meil.pw.edu.plTeresa Zielinska teresaz@meil.pw.edu.pl 1 Institute of Aeronautics and Applied Mechanics, Warsaw University of Technology, ul.Nowowiejska 24, 00-665 Warsaw, Poland Recent research focuses on actions recognition problem [16,24,32].Although few recent works addressed the problem of ongoing activity recognition with partial information avilable [31,36], they do not answer how to perform activity prediction.A reliable action prediction relies on selecting and processing the crucial information, e.g., scene context, object properties (affordance, object texture) and relative human-object posture.The action prediction has two features: -anticipating human actions requires identifying the subtle details inherent in human movements that would lead to a future action, -the action prediction problem must be carried out with the focusing on temporal human interactions with the environment (e.g., interaction with the objects or with the other people).
In this work, we discuss the problem of action prediction in natural scenarios using collection of examples of human actions in the real world sampled by video records (WUT-ZTMiR 1 dataset, CAD-60 2 dataset).We investigate how the user behaviors evolves dynamically in a short time.Our goal  is to infer the action that a person is going to execute in the nearest future.This paper is an extension of [10].Comparing to previous material which was a short description, the method presented in this paper is an enhanced version with all relevant details.Moreover an improved method for temporal segmentation and feature extraction is introduced.The applied probability functions are here justified and the so-called limiting condition (Section 7) is summarized.Additionally, besides of previously described online experiments, the testing using offline data is discussed.We evaluated the action prediction problem in both: real-time and offline settings across the two datasets covering a wide range of actions.The contribution of this paper is four-fold: -an improved method for action prediction is proposed, -the concept of object affordances and scene context for human action prediction is formally described, -a rapid training and testing method for action prediction is summarized, -the proposed method is tested together with its efficiency evaluation.
The remaining part of the paper discusses this contribution in details.In Section 2 we had reviewed the related works.Section 3 describes the physical setup of the experiment.Video pre-processing is discussed in Section 4.
The probability functions are summarized in Section 6.The description of motion trajectories is presented in Section 6.While Sections 7 and 8 present the implementation method and experimental results.The paper is ended with conclusions.

Related Works
Action Recognition Human action recognition becomes an extremely important research topic.Earlier research addressed mostly recognizing simple human actions, such as running, walking and standing in constrained settings [32].However, recent research has gradually moved towards understanding complex actions in real-time records and in still images collected in various conditions [21,22,44].These data typically involves occlusions, noisy background, changing viewpoints, etc and requires significant efforts on action recognition.Most of the action recognition approaches based on the still images, treat the problem as a pure image classification problem using i.e., mutual context model [43].
The mutual context model consider the bounding boxes of objects and human body parts, which is difficult to obtain especially with a large number of images.Another works consider the human "skeleton" features collected using the Kinect sensor together with object position.
Recent contributions rely on the scene modeling [17] and human pose description [34].The work presented in [6] presents transition of a "skeleton" pose through a Riemannian manifold.Riemannian manifolds have been confirmed useful for dealing with features and models that do not lie in Euclidean spaces.Those manifolds are used to analyse the human action similarity graphs that are mapped to a new space.A similar approach was adopted by Slama et al. [35], who classified activities using a Linear Support Vector Machines (LSVM) taking into account trajectory of human position using a learned Grassmann manifolds, which are special cases of Riemannian manifolds.The work described in [28] utilizes a multivariable Gaussian distribution to model the intermediate poses.The temporal deviations of activities were considered.Papadopoulos et al. [26] proposed a real-time "skeleton" tracking-based method for human action recognition which uses as an input a sequence of depth maps captured by a single Kinect sensor.The approach applies a motion energy-based concept, the spherical angles between the selected joints are evaluated with their respective angular velocities, for handling the execution differences among the individuals for the same actions.
Action Prediction Recent research has attempted to expand the concept of human action recognition to future actions.Some recent contributions on predicting actions are aiming at recognizing of the unfinished actions.The method described in [15] uses the so-called max-margins for discriminating the action classes.Lan et al. [20] proposed a hierarchical representation of future possible actions.Li and Fu [23] explored the prediction problem for long duration actions.However, their work concentrated on identifying motion fragments by finding associated velocity peaks, it is not applicable to the unconstrained set of movements.The work presented in the article [39] describes how to consider object affordances for predicting in a static scenario what action will happen.An activity forecasting, which aims at reasoning of a human preferred motion path for a given goal has been explored in [14].
Other works capture human actions by representing the possible motion trajectories taking into account the detected point of interest [30,40], the so-called key-frames were used for this purpose in [29].Most of the previous contributions on action recognition methods were designed for recognizing complete actions, assuming that the action in each test will be fully executed.This makes these methods not appropriate for predicting actions in partial trials.
We concentrate on a probabilistic approach to model an action prediction (Fig. 2) taking into account object affordances (the affordance will be explained later in the text).We represent the types of human actions using a dynamic representation of human-object relation.Our method incorporates an action as a sequence of human postures and relates it to the information about object affordances, which is a new approach comparing to [20,29,40].The proposed method allows also to predict long-term activities and allows to visualize how a human is going to perform an action, using trajectories prediction.Appropriate training method together with affordance function reduces the computational time comparing to the other methods [14,18].The experiments indicated that the proposed method is promising and be advantageous comparing to the state-ofthe-art results.
Action Prediction Methods A general overview for action prediction is summarized in Fig. 2. As it can be seen that methods consist of 3 phases, namely: (a) preprocessing, (b) feature extraction, and (c) model formulation.Preprocessing is the term for low-level operations using videos.Generally, the aim of preprocessing is to divide the input video into multi-temporal segments, each segment represents an action or a sub-activity.The feature extraction means the finding of most compact and informative set of parameters (features) which should be selected in such a way that they provide sufficient information and assure the processing efficiency.The next step is to model formulation which means building a model representing an action and to recognize an activity.
The feature extraction methods are in general categorized into three categories: (a) low-level features, (b) mid-level features, and (c) high-level features.
Low-level methods (HoG) [5] focuses on a static appearance and shape within an image frame that can be described by the distribution of intensity gradients or edge directions.To this group belong also the methods in which the motion trajectories are obtained by tracking densely sampled points using dense optical flow fields [41].
The mid-level category considers mainly the semantic meaning of a scene and usually is build using low-level features.In actionlet [23] method belonging to this category, the first step is to temporally divide the activity into actions (reaching, drinking).These segments are called actionlet which represents the atomic actions.Poselet [2] feature extraction method describes a particular part of a human pose under specific viewpoint.Poslets are not necessarily semantic.Onset [31] feature extraction approach captures activity information from the sequence of actions which are components of an activity.The onset concept summarizes pre-activity observation in addition to ongoing observation.
High-level feature concept [12] uses input videos for extracting together with spatial and temporal features.
The created models can be of three types: (a) discriminative model, (b) generative model, (c) deep network respectively.Discriminative model generally use conditional probability distribution.A Support Vector Machine (SVM) is the most propular approach used for data classification [42].Conditional random field (CRF) models are used for describing the predictions [18].
The generative model concentrate also on modeling conditional probability distribution but they require more detailed descriptions parameterized by time.The most common tool used for action recognition and action prediction is Hidden Markov Model (HMM) [3,27].
Example of deep network models are the models using "memory" [25] created in the form of recurrent neural network.Such models are able to capture the useful information using previous observation and holding the long-range context.In [38] a regression network was used to anticipate human actions.Such networks was trained to predict the visual representation in the future.

Physical Setup
The setup for the data recording consists of two fixed viewpoint cameras placed on tripods with adjustable height 1.5 -1.7m.Applied by us Senz3D [4] RGB-D cameras consist of an RGB and of a depth sensor.We used down-sampled images (640 × 480 pixels) since realtime decoding and display of multiple streams of a highresolution video is a bottleneck problem.The recording rate was 60 frames per second.The camera system has the ability to register the 3D locations such as a human pose and object We used several objects on which the manipulations were performed.The cameras range for human observation is 1 -3m, as it is shown in the Fig. 3b.The experiments were recorded both -in a day and -in an evening time, thus the lighting conditions had varied from daylight to an artificial one, with involving both plain and textured background.
Customized programming tools were developed for data extraction from the raw images.The software was developed using the C++ language and robot operating system (ROS).We applied a single coordinate system with the origin placed in the mid point between the two cameras (see Fig. 3a).The transformation of the positions registered by each of the cameras was done respectively [13].

Basic Assumptions
A human activity is a something that a person is doing or is going to do, activity is a state of doing.We assume that an activity consists of a sequence of elementary actions.The aim of our work is to predict the actions that can help next in anticipating the human performance in the fragments of the whole activity what finally, after more research, can end with a whole activity prediction.
The first stage of our work consists of data collection, next the data are preprocessed and used for establishing the probabilistic models describing the possible motion trajectories.Finally, those models are used for actions prediction.On the end, we had done the set of tests evaluating the correctness of predictions.

Recording
We first recorded activities performed by 4 different persons: 3 males and 1 female respectively.Let us denote by A an activity.For each activity we had done M recording.Let's F A m denotes m-th record of an A-th activity (in our case M = max(m) ≥ 50).Each record F A m consists of f frames, where f can vary from case to case.The m-th record of A-th activity consisting of f frames is denoted by F A m (f ).In our work, we considered temporal segmentations by partitioning the activity into group of actions.
It means that F A m (f ) is divided into smaller parts by the human expert (see Fig. 4).Each part represents an action (actions are the parts of activity).Therefore, the a-th part of , where max(a) ≥ 1.The segmentation is made having in mind that the final goal is to predict an action.The segmentation must be made in such way that the groups of frames in each segment are representing atomic movements of the human and/or of an object subjected to an action.It must be noted that a mistake in segmentation affects following-up prediction procedure and all the predictions can perform poorly.We carefully followed the segmentation approach described in [15].

Temporal Segmentation
The part representing a i -th action is again divided into segments.It not requires the expertise and the division method is fair enough.Such segmentation is needed for creating the probability functions.As an action can start when the human hand is in different distance towards the object of interest and can be made with different speed for gathering the relevant data is needed to selected and analyse some fixed amount of video segments covering the period till end of an action.
Let us consider a complete a k action segment containing f k frames.We divide into K uniform segments (in this work K is fixed to 10).In general, each segment contains f k K frames.For different video lengths, f k will vary.In certain cases, it requires an appropriate unification.For example, if an action video a k is containing 233 frames, it is not possible to segment them uniformly into K segments.In such cases, we do additional pre-processing.Let us take an example, if • K] = 0, the frames are uniformly segmented into K segments, otherwise, we select the first image frame f (1) k from the action a k and multiply it times (see Fig. 5).With such modification is possible to make the further automatic processing.

Features Extraction
We can recognize human action by looking at his/her current pose and interaction with the object/objects over a time, this is captured by a set of the so-called features.Features are quantities which are relevant for establishing the probabilistic models of actions.Finding a good feature extraction is very target oriented.Used by us features are rather simple.
In this work, we extracted three important features: (a) hand and torso position H , (b) object position O, (c) spatiotemporal features which are in our case -the distance d and angle θ as it is shown in Fig. 3a.

Hand Position
The feature H describing the hand wrist joint (for both hands) position and torso position is obtained using the library Skeltrack API 3 which automatically delivers the positions in the camera fame (next the simple transformation converts it to the global reference frame).The information about the hand is very important.In particular, we want to capture information such as "hand is near to the object" or "hand is near the mouth".To do this, we evaluated the distance of the hand to the object and to the camera respectively.In general, the Skeltrack library provides the tool for the stick diagram visualization of the human body is described by the length of the links and the joints (Fig. 6).More specifically, in this work, the tracking algorithm detects the position of the following set of points in the 3D space, denoted by J = {Torso (T), Left Wrist (LW), Right Wrist (RW)}.The features are the (x, y, z) coordinates of each above point.Therefore the features matrix H is defined by, where x, y are the positions of the points as they are seen in video frame.These positions, expressed first in terms of pixel coordinates, are converted to the metric coordinates in the global reference frame.The z coordinate comes from the 3 https://people.igalia.com/jrocha/skeltrack/doc/latest/cameras distance sensors and is expressed in the same units as the first two coordinates (see Fig. 3).
Object Features O represents an vector containing the x, y, z coordinate of the object center.For full description we use object identificator and position information (identificator can be represented by QR code of an object).In our work, we did both: the object detection, and tracking respectively (Fig. 7).
We consider two types of objects: (a) larger objects (i.e.door, table, box, whiteboard, etc.), (b) smaller objects (e.g.marker, bottle, cup, etc.).Larger objects are labelled by QR codes which can be properly recognized from different points of view using label-based object detection method [8].
For the smaller objects we use the "Lucas-Kanade" descriptor (KLD) field test 4 which in simplicity means the search of an object which picture is stored in the data base.Moreover we evaluate the distance passed by the objects when they are manipulated by the human being.

Spatio Temporal Features
The spatio-temporal features, namely the distance d and angle θ (Fig. 3a) describe the relation between human hands and the objects feature.
We evaluate the distance d from some moment of time till the end of an action (in our case we consider the frame which marks 40% of frames from the end of the action segment).For each action, we collect such data repeating the recording M times.For each record we store the distance between the hand wrist position in the mentioned above moment of time and the object of interest (object to be manipulated).The collected data are used for evaluating the mean value μ d , μ θ and variance σ 2 d , σ 2 θ which are applied later as the probability function parameters.Those functions are used for concluding about the destination of performed motion.
For each objects which can be manipulated (each action) in the human vicinity the video recordings are made as it was described above and the values of μ d , μ θ , σ 2 d , σ 2 θ are gathered for each of those objects.Next for each object the probability function representing the chance of being manipulated is produced.The function consists of two terms -the term which is the so called distance preference (DP) and the term which is the angular preference (AP).The distance preference probability P (DP i ) (for action a i ) represented by the normal Gaussian distribution and the angular preference probability P (AP i ) is described by modified Wrapped Normal distribution.The justification of such functions selection will be discussed in Section 5.
For simplicity we can say that during the human hand motion as the most possible object to be manipulated (this is associated with the action) will be indicated such object to which the current distance (and the angle) is closest to μ d (μ θ ).More precisely, applied probability functions will be delivering the probability of reaching each of the objects of interest providing for each of them probability created on the basis of current value of d, θ and the set of μ d , μ θ , σ 2 d , σ 2 θ .This is an action selection.Such action a k is selected among all possible actions a i (i = 1, 2, ...., N), therefore:

Probability Functions
The probability of an action is naturally related to the object of interest and the " easiness" of reaching/manipulating it.Therefore we call it the object affordance.The object affordance in our case results from the angular and distance preferences which are expressed as a product of two probability functions justified by experiments.

Distance Preference
The distance preference is described by normal Gaussian distribution parameterized by mean μ d and variance σ d 2 .
The standard statistical test was applied to check whether the data are consistent with the selected distribution.A common test in such case is a Shapiro Wilk normality test, it has good performance for the smaller amount of samples as it was in our case.The normal distribution plot given in Fig. 8 proofs that the distance features are following a normal distribution.Figure 8 also visualizes the probability distribution when Angular Preference Angular positions of the human towards an object is very relevant in certain actions.For example, reaching action covers a wider range of angles than the drinking action.It is a "circular" statistics [33], where the data are expressed in an angular scale, typically around the circle.Here we applied the Wrapped Normal (W N) distribution introduced in the article [19].The W N distribution is one that it is expressing the probability density function of a linear random variable to the circumference of a (unit) circle.It can be added that the von-Mises and the W N distribution are very similar.They both are the circular analogs of the normal distribution.However, the Wrapped Normal Distribution is more convenient for reasoning and is well explored in various activity recog-nition approaches [1,7,11].Therefore, in our work, the probability function of angular preference is expressed as, At the very beginning, we considered von-Mises distribution to capture the angular data [9].However, due to its poor performance for larger values of σ , our choice moved to such distribution that possesses the normality feature for the larger values of σ .In such case, the modified version of the W N distribution, which is expressed in terms of Jacobi theta function is an appropriate choice.
The corresponding angular probability distribution function integrates to the unit in [0, 2π].The justification of Fig. 8 From left to right: 2D Gaussian distribution (both side view and top view) for reaching an object, x and y represents the coordinate of the points (as described in the text) marking the hand position (the figure is taken from [10]).On the right the figure shows the histogram plot of the data justifying the normal distribution the angular preference probability function was made using goodness-of-fit tests based on Watson's U 2 [37] statistic.A goodness-of-fit test enables to determine whether or not more complex models need to be considered.The advantage of Watson's U 2 statistic is that it is location invariant and thus does not depend on how the starting direction is assigned to the circle.A circular plot (Fig. 9) of the chosen statistical test for different values of the parameter σ shows that the data successfully follow the considered normal distribution (i.e.U 2 < U 2 critical ).

Motion Trajectories
Prediction An object can be approached by various types of motion trajectories depending on the action that is going to be performed [10].Once a location is estimated basis on probability function given in Eq. 2, we generate a nominal future trajectory form the current location (i.e., depends on the situation, can be hand, object and a considered joint) to the predicted target location.
A nominal future trajectory of the human hand is produced using the parameterized cubic equation of Bezier curve (see Fig. 10).The advantage of this equation is that it will not generate a fragment that lies outside the outline of the so-called the control points (commonly called the "hull" for the curve).In fact, we can control how the relevant points Fig. 10 Graphical representation of cubic Bezier curve contribute to the value generated by the function, so we can influence how the points are important to the curve.The Bezier curve is a polynomial of p, with p Bezier interval being in the range < 0, 1 > : where t i = {x i , y i , z i } and i = 0, 1, 2, 3.Such a cubic Bezier curve is parameterized by a set of four points: the start and end point of the trajectory (t 0 and t 3 ), and two control points (t 1 and t 2 ) which define the shape of the curve.In our case, t 0 is the current position of the hand.The point t 3 is the end point of the action indicated by the probability function.The control points t 1 and t 2 are produced using the training data.Point t 1 and t 2 are the points taken from the previously recorded trajectory which has its beginning closet to the t 0 .

Heat Map Around Trajectory
We defined a potential function that allowed us to visualize the possible motion area.The heat-map visualization was implemented using exponential Gaussian kernel function.The map visualizes the active region around the trajectories and the target location, when the corresponding affordance is active.We implemented the heat-map visualization model in a software module using exponential Gaussian kernel function f (h m ).
d T represents the point in question (e.g., the current position of a human hand) and μ d T is the point of the anticipated trajectory or the target location.Parameter σ denotes the radius of Gaussian kernel (the value is adjustable in this work).With the above estimate by Gaussian kernel, we can visually represent the expected regions with marking the greater heat by "warmer" colors.Accordingly, the "red color" denotes the maximum likelihood region.

Implementation
During implementation, the collected sets of d and θ were grouped with respect to the objects of interest.When the person starts the motion and in the environment are several objects of interest (objects which can be used when performing the activity) it is not clear to which object the person will focus.To make the reasoning process simpler we introduced a limiting condition f (R) for selecting which set of objects (that means also which sets of actions) should be considered.The condition is as follows: Where T o t near and T o t f ar represent near and far distance limits to the object.Using this condition only the objects which are relatively nearer to human hand are considered.For example, in real-time experiments, the following objects of interest were identified (a) a glass, (b) a bottle, and (c) the door.The distance between human torso and the door is relatively smaller than the distance between human torso and the glass, or the bottle.In our software, system we also defined the associations between the objects and the actions type which can be performed on them [15,18].For example, if a hand is near to the glass, the possible action will be grasping.But if we consider the same situation and the object is a computer monitor, the possible action would be turning on/off instead of grasping.It does not requires any sophisticated algorithms.

Testing
In this section, we describe the evaluation of the presented method for both: (a) offline data and (b) real-time settings.We first give the details of the dataset in Section 8.1.We then present the experimental results in Section 8.2, and the performance analysis of the proposed approach is discussed in Section 8.3.
Fig. 13 Action prediction accuracy results.The comparisons of the proposed method against other methods on both WUT and CAD datasets (the figures are taken from [10]).The figure is best viewed in color Fig. 14 Error matrices of action prediction on the test video records of both WUT and CAD dataset.Figure 14a shows the confusion matrix of prediction accuracy for WUT dataset.The confusion matrix of prediction accuracy for CAD test dataset is shown in Fig. 14b 8
We created a publicly available dataset (named as WUT-ZTMiR) of human activities recorded in the office

Experimental Results
We conducted an experimental evaluation comparing our method to other methods using: (a) the so-called "chance model" which randomly selects the time moments and makes the prediction for that time, we use its published code and followed the settings given in [17], (b) the method using Hidden Markov Model (HMM) in which the hidden state sequences corresponding to the observation is considered, (c) Linear Support Vector Machine (LSVM) method where the transitions between the actions are focused [42].All methods requires the ground truth progress to be known in the testing phase.We were following the settings given in [42] and tuning parameters according to our needs.We actually achieved comparable performance to those reported in [17,42].The proposed method was evaluated using testing video records.The observation ratio is defined as the proportion between the frames considered as observed towards the total amount of frames.Figure 13 gives the comparison of our method with the other baseline using the two datasets described in Eq. 8.1.We applied a test video with different combinations of action.The accuracy (prediction rate) is defined by Eq. 8 and the results are shown in Fig. 13.
In order to evaluate the interpretable aspect of our method, we demonstrated its ability with using error matrix, also known as a confusion matrix (see Fig. 14).Note that in Fig. 14b a diagonal indicates few errors, such as closing sometimes was predicted as an opening, the reason is that both movements range is small.Moreover the placing action was predicted as a pouring due to the problem with light sensitive object recognition.
In the experiments, we found that the proposed method is generally capable of improving the high-level detection using joint reasoning.For example, a "closing microwave" video has an input action prediction accuracy of 48.9%.After joint reasoning, the output action prediction accuracy raised to 64.0%.

Performance Analysis
The evaluation of the proposed method was made using binary scores c 1 is true -when the system successfully identified an action that match a real situation (ground truth), c 2 is true -when the system rejected an action but in reality there it is the actual action, i.e. ground truth.c 3 is true -the system identified an action which does not match the real scenario, In Eq. 9, the global scores t p , t n , f p , and f n are evaluated as follows.
Where t p , t n , f p , and f n are known as true positive, true negative, false positive and false negative respectively.The above binary scores were used to evaluated the recognition accuracy of an unfinished action and its limitations.Following [10], the precision (P r) = t p t p +f p , recall (Re) = t p t p +f n , and F-score (F c) = 2 P r•Re P r+Re were calculated, results are summarized in Table 1.
We analyzed the prediction of all the selected actions in CAD-60 and WUT-ZTMiR dataset and observed at what stage an action was predicted.In general, we defined two categories of action prediction according to the prediction stage: early prediction (EP) and lately prediction (LP).Early prediction means that the action was predicted if no more than 30% of the video was observed.However, the LP means that an action was predicted if more than 30% but less than 60% of the video was observed.Results are shown in Table 2.
Basis on the results is can be concluded that the proposed method performs well with partial observation (up to 60%), and is capable to make the real-time prediction with our equipment.It was collected 60 frames per second using 3.7 GHz Intel core 7 computer with 16 GB of RAM, with 64bit Linux operating system.The average prediction time vary from 0.18s to 0.32s, what is acceptable for real-time applications.
Figures 16 and 15 show the visual output of the human action prediction for both: offline and online datasets.The blue circle indicates the moving hand and yellow circle indicates that the hand is stationary.The red curve with green and yellow outline along around black trajectory describes the possible future action.Figure 17 shows the predicted trajectories in 3D space (blue color) with respect to the ground truth trajectories (black color) of a particular action as well as the training sample trajectories defined by violet color.

Conclusions
In this paper, we presented the problem of human action prediction.The scheme illustrating the main components of the method is given in Fig. 18.The object affordance concept for predicting future actions was applied.The most possible motion trajectory was used as the "kernel" for producing the heat-maps representation of expected trajectory disparity area.The method was tested using on both: offline and online data.Obtained results were quantified and the method was validated as satisfactory.For selecting the possible actions we considered the probability functions which is based on the normal distributions.The choice of such function was justified, however, it would be interesting in the future to investigate the other possible distributions.We also showed that it is important to model the different properties (object affordances, temporal interactions, appropriate segmentation, etc.) in order to achieve good performance.In future, our intention is to study a wider range of actions with different environments and to expand the prediction process from the selecting one action among several alternatives to the chain prediction of actions aiming an activity.

Fig. 1
Fig. 1 Selected pictures illustrating the action prediction: a available observation, b the final action (post recording, this should be prognosed)

Fig. 2
Fig. 2 General framework for human action prediction

Fig. 3
Fig. 3 Camera settings for video recording: a top view, b side view

Fig. 4
Fig.4 Graphical illustration of an activity segmentation into actions.A human expert produced the groups of action (a ii-th action)

Fig. 5
Fig. 5 Example of a temporal segmentataion of an action

Fig. 6 Fig. 7
Fig. 6 Pictorial representation of human pose.The left image illustartes the RGB image (ground truth) with "skeleton" detected and the right image shows the extracted sketch diagram representing the human body

Fig. 9 Fig. 11
Fig. 9 Circular plot of proposed angular distribution for recahing an object.The figure is best viewed in color

Fig.
Fig. Example images of drinking, placing, reaching actions from CAD-60 dataset environment under RGB-D settings, i.e. color plus depth as shown in Fig. 11.The following activities are the part of the dataset: drinking water, opening a door, object placing, etc..The Cornell Activity Dataset (CAD-60) is composed of 12 different activities (see Fig. 12), performed in 5 different environments: (a) office, (b) kitchen, (c) bedroom, (d) bathroom, and (e) living room.The activities are performed by 4 people: 2 males and 2 females.The dataset is a Number of correctly predicted actions Number of total actions.

Fig. 15 Fig. 16
Fig. 15 Qualitative results of action prediction on CAD-60 dataset.The figures show the predicted right hand trajectories with heat maps.The following actions are: a reaching, b pouring, and c closing.The figure is best viewed in color

Fig. 17 Fig. 18
Fig. 17The 3D graphs of both ground truth and predicted trajectories of an action while performing a task.The actions are following: a reaching, b pouring, and c drinking.The figure is best viewed in color

Table 1
Confirming the correctness of trajectory prediction on WUT and CAD dataset, showing average precision, reall and F-score for the actions

Table 2
Early and late predicted actions in both WUT and CAD dataset