Keywords

1 Introduction

The astonishing stature that separates humans from other forms of biological entities is its inherited capability of learning. Our ancestors were been able to excavate the potentiality of fire woods, coals and other objects that afford, not only to lit fire but also cooking. In plain context the tacit knowledge of learning the usability of different objects is an integral part of the success story of the human race. In the domain of Computer vision and Artificial intelligence, it is a very important topic of research. Though it sounds very linearly simple, but the research paradigm in this field is quite multidimensional and encompasses both fine level and high level layers. For example in practical robotics the premier focus is to master the identification of fine level usability of the objects (i.e. rolling ability, graspability etc.) in contrast in computer vision the focus is on some higher tones. Here researchers are more inclined to detect higher level usage of objects (i.e. how a human performs interaction with a computer). The way these two streams of researches are approached is often disjoint. In this paper we have tried to portray the benefits of combining the techniques from these two sects of researches. In broader terms we have used objects basic visual features (SIFT, textons, edges, color histograms etc.) to infer some higher level attributes such as its material, shape, size, visual parts etc. to find the objects usability. We have also tried to boost the detection process of objects usability by using different contexts such as the human demonstration (i.e. body poses) and ambient objects (i.e. the induced effects of one object to others). We believe this integration of attributes and contexts make the detection process more robust, semantic and general.

1.1 Psychological Perspective of Affordance

The theory of affordances [1] was introduced as a theory of direct perception, which can account for findings in development psychology. According to [2, 3] An affordance is an intrinsic property of an object. In broader sense, affordance is the functional classification of objects. Affordance is neither subjective nor objective. It depends on the object being interacted, the human who interacts and the ambient objects. For example, a chair is meant for sitting for an adult but for a toddler it does not have the sitting affordance rather, it has the climbing affordance. On the other hand when we see a mug alone, we infer its affordance as drinkable but as soon as we see a pitcher on top of it, its affordance space accumulate the pourable affordance as well. Hence, it is evident to state that, the affordance of an object is basically the mapping of these three contexts.

1.2 Ontological Classification of Affordance Detection Techniques

Generally affordance prediction has been approached from two different prospective. Firstly, the methods that learn the affordances passively by observing the humans interacting with the objects [46] and on the other hand methods that use objects visual features (appearance) to learn the objects affordance [7, 8]. Usually the first category of works does focus on the high level affordance detection where the later works concentrate on the finer level affordances. But these conventional views towards affordance detection are now changing and there is a new trend where the researchers are combing both human actions and objects perception for affordance detection [913]. These mixed approaches are primarily used for higher level affordance detection [4, 912, 1416]. Apart from robust affordance detection, these blended techniques, emerging substantially as an adequate tool for solving different problems in the computer vision domain, such as: classifying object categories [9, 10, 17], scene understanding [1820], segmenting sub-activities from continues high level activity [21], robot navigation [22], robot-object placement [23], anticipation of human action [2426] etc.

There is also a variant ontological prospective in the fine level affordance modeling approaches. It is visual features [7, 8] versus the physical attributes of the objects [2729]. In the visual features based approaches, finer features like Corner points (SIFT, HOG), edges, texture (textons), colour (Histograms) etc. are extracted from the objects and directly mapped with affordances. In contrast with that, in the attribute approaches, the finer level visual features are used to predict mid-level physical attributes [27] such as size, shape, material, weight etc. A key advantage of attribute based detection is the ability to leverage object properties which are shared by multiple affordances, leading to more effective generalization to novel examples and the ability to learn new affordances with limited training data.

1.3 Challenges in Affordance Detection

There are fundamental difficulties in both the above mentioned approaches. Regarding direct perception approaches, there are three major issues. Firstly, affordances are not actually determined-in the physical sense, by visual features, rather by the physical properties of the objects [30]. Whether an object can roll or not depends on the shape of the object; whether it can be pushed is influenced by its material properties. Secondly, the visual features are very much vulnerable from different imaging and viewing phenomenon. Thirdly, a liability of the direct perception based methods is that there is no form of knowledge transfer between the object classes. In contrast, problems with the approaches related to human demonstration (both considering the objects and without it) are that, human could perform same actions with different objects (like mopping a table has similar body pose of ironing). Moreover a single object can have multiple affordances, therefore it is required to train the system with each action-object pair and consequently the training process becomes very complex and lacks the generalization (Systems usually suffers if an unseen object or body poses are considered). Another challenge in demonstration based affordance detection is that affordance depends on the ‘attributes’ of a person; it will not remain same for all the humans with same object. For example if the height of the human changes, than the possible actions that can be performed with a certain object will vary. Even the attributes of the objects and the ambient environment (other objects nearby) influence the affordance of an object.

2 Overview and Contribution of Our Approach

The core competency of our model is that it takes into account mutual contexts of the attributes of the object, the human and the ambient environment. We believe the affordance detection process of an object can improve substantially by considering the mutual relations of these contexts and their attributes. For example a knife has the primary affordance of cutting. But if the knife is made of plastic, it does not afford to cut harder objects. Here, we can see the change in the attributes can change the affordance space of that object. Similarly if we see a human is performing a stirring action with a knife, than the affordance space of the knife again changes and accumulate stirring. It shows that the body poses of the human helps us to infer the affordance of an object. Again if we see a knife is near to a food can or a biscuit tin, it may afford opening them. Here the ambient object induced the change in the affordance space of the object. Especially from static images, where unlike the video no temporal references are available this process of mutual context analysis helps the system to develop a knowledge base and detect affordance more robustly. Furthermore, since our approach of affordance detection represents these contexts as sets of different attributes rather than considering them stand alone entities. It ensures the system to be more semantic, efficient, dynamic and general. For instance, in accordance with the previous example, we do not detect/classify the object as knife rather we describe it as a rectangle, metal object with sharp edges. We describe the objects with different attributes according to [30]. Simultaneously we also describe the mutual relations of the objects with human and the other ambient objects by a number of attributes [29]. Have used this attribute based representation for unseen object class detection, and they claim that this method does possess knowledge transfer mechanism and helps to recognize unseen and untrained objects. We have find that attributes are shared between the spectrums of different affordance classes also. For instance most of the objects that afford drinking (i.e. mug, cup, bottle, flask etc.) are cylindrical in shape and may have a handle (i.e. a mug handle).

Fig. 1.
figure 1

Opening a jar (top row), Opening a poly(2nd row), Opening a drinks can (3rd row), Opening packet (4th row) and opening a tin with knife (bottom row).

In this paper, we have portrayed the importance the attributes in detecting human object interactions robustly. For instance, in Fig. 1 we consider opening object actions. We have multiple opening scenarios, like: opening a can, opening a packet of potato chips, opening a flask, opening a box etc. We analyzed and inferred that, we have different opening body poses for different object classes due to the difference in the attributes of the objects. At the same time the attributes related to the human and the ambient objects are also important. Our work is inspired by the works of [27] and [31]. The main focus of this research is to combine the different notions of affordance modeling in order to achieve a robust affordance model. We are using the visual features to predict mid-level physical attributes of the objects and as well as the human and the environment (the other nearby objects). After that we use the physical attributes as the features to learn (both as parameters and the structure) our high level affordance detection model.

Our novel attribute based affordance model, encompasses two types of features related to the human, object and the environment in order to model objects affordances, namely: visual features and physical attributes [27]. The visual features are the basic image features extracted from images.

The Visual Features that We have Considered are:

  • For the Objects: SURF(speeded up robust features) features, HOG (Histograms of oriented gradients) features, Edges, Textons, Region properties of bounding boxes, Image histograms, Euclidian distances between multiple objects.

  • For the Subjects: Human body joint coordinates from kinect, the angles between the shoulder-arm-wrist (for both left and right hand).

After extracting the visual features, we have created multiple classifiers to classify physical attributes related to both objects, human and ambient objects.

The Physical Attributes that We have Considered are:

  • For the objects: Material, Aspect ratio, Height, Objects shape, Color, Orientation.

  • For the Human: Body poses, angle of the arms.

  • For Human-Object: The distance of the object(s) from each body joints.

  • For Object-Object: Euclidean distance between multiple objects, the spatial location of objects relative to other objects, relative aspect ratio of multiple objects.

The flow of our system is as follow: First, given images with human interacting with different objects, we select the bounding boxes of the object(s). After that we extract the base features from the selected bounding boxes (objects). We also extract the body joint coordinates of the human and the angles of the arms. Then we use these base features to train mid-level attribute classifiers. Subsequently we use these mid-level attributes as the features of our overall affordance model. In the test scenario, given the bounding boxes (the user provide the bounding boxes), the system can detect affordance of the selected objects more semantically and robustly.

3 Attribute Based Affordance Model

Our affordance model can be formalized by the following statements:

  • The affordance space as (\(\lambda \)) where (\( \lambda \)) is a m dimensional vector.

  • Objects visual features are (\(\theta \)) where (\(\theta \)) is a t dimensional vector.

  • Objects physical attributes are (\(\alpha \)) where (\(\alpha \)) is a p dimensional vector.

  • Body pose features are (\(\beta \)) where (\(\beta \)) is a q dimensional vector.

  • Humans physical attributes are (\(\gamma \)) where (\(\gamma \)) is a r dimensional vector.

  • Ambient environment attributes are (\(\varepsilon \)) where (\(\varepsilon \)) is a s dimensional vector.

Then, we can formalize the model as:

$$\begin{aligned} (\lambda )= f(\alpha ,\beta ,\gamma , \varepsilon ,\theta ) \end{aligned}$$
(1)

So, if we want to represent the relations of these components in a joint distribution form:

$$\begin{aligned} p(\lambda ,\alpha ,\beta ,\gamma , \varepsilon ,\theta )&= p(\lambda \mid \alpha ,\beta ,\gamma , \varepsilon ,\theta )p(\alpha \mid \beta ,\gamma , \varepsilon ,\theta ) p(\beta \mid \gamma , \varepsilon ,\theta )p(\gamma \mid \varepsilon ,\theta ) p(\varepsilon \mid \theta )p(\theta )\end{aligned}$$
(2)
$$\begin{aligned} p(\lambda ,\alpha ,\beta ,\gamma , \varepsilon ,\theta )&= p(\lambda \mid \alpha ,\gamma , \varepsilon ) p(\alpha \mid \theta )p(\gamma \mid \beta )p(\varepsilon \mid \beta ,\theta ) \end{aligned}$$
(3)

So, for finding the affordance, we can marginalize \(\lambda \), and we get by the variable elimination method:

$$\begin{aligned} p(\lambda \mid \alpha ,\beta ,\gamma ,\varepsilon ,\theta )=\sum _{\alpha }\sum _{\beta }\sum _{\gamma }\sum _{\varepsilon }\sum _{\theta }p(\lambda ,\alpha ,\beta ,\gamma , \varepsilon ,\theta ) \end{aligned}$$
(4)
Fig. 2.
figure 2

The Bayesian network representation of the proposed model.

Currently we have implemented our attribute based affordance model with Bayesian network (Fig. 2). We have compared two different scoring methods for learning the structural model of the Bayesian network, (1) Bayesian Information Criterion and (2) Greedy hill climbing (optimization). For inference we have used junction tree algorithm. Apart from the Bayesian network, we have also implemented our model with a multi-dimensional SVM and K-Nearest neighbor algorithm. In the case of the KNN, we have tested the model with Euclidean, Cityblock and Minkowski distances. We have used the N-fold validation for cross validation of the model.

3.1 Attribute Classifiers

As we have stated earlier the mid-level attributes are classified from the base features. We have implemented separate classifiers for each of the mid-level physical attributes.

Parts Classification. We have introduced a novel physical attribute called Parts. It is basically distinct image patches of objects which are common in all the objects in a single affordance class. Different object classes can have a single affordance, we argue that though these object classes are dissimilar in visual aspects but they do share some common parts. For example the objects which have affordance of drinking or pouring usually have visual patches of a handle. These parts have proved to be a robust cue in our affordance detection model. For the part class detection, first we manually selected distinct parts of different objects (5 parts per affordance class) class that has a common affordance, and then these parts (patches) are cropped out from the object images (we have used 750 patches for each part class). In Fig. 3, different selected parts of the sitting affordance class is shown. These cropped patches are finally used as features for our part class detection classifier. We have trained our parts detection classifier with Bag of features algorithm, where we have trained the classifier with vocabulary sizes from 1000 to 4000 with 500 interval and final set the vocabulary size to 1500 (1500 clusters) since it has given us the highest accuracy. Patch size was set to [64 128 192 256] for the optimal efficiency. Finally a multiclass SVM is used as classification algorithm. The grid points (SURF points) were selected densely for the bag of features algorithm. For the part classification, we have achieved 71 % accuracy.

Fig. 3.
figure 3

The common parts in diverse objects that afford sitting.

Material Classification. For the material detection of different objects (what the object is made of), we have extracted SURF points, HOG features, Textons [32] and image histograms from object images. These features are then subsequently given as inputs into a K-nearest neighbor classifier to detect materials. We have considered material type of: paper, metal, plastic, poly, food, glass and cloth. We have tested and compared our classifier with [32] and [33] where Textons and Fractals are used, and found that, our classifier is more suitable in detecting materials of real life objects. Real life objects have a lot of labeling and undesired interest points. Though [32] performs better with basic surface texture images of different materials but loses accuracy for real life objects. For the material classification images of each object class (1200 images per object class).

Aspect Ratio and Height. Aspect ratio is a popular measurement that gives us a cue about the size of an object with some sort of scale invariance perspective. We have calculated the aspect ratio as the width over height of the selected bounding boxes. On the other hand height is simply the vertical height of the selected bounding box. We have simply used the measurements of the object bounding boxes as features for our aspect ratio and the height classifier, where a multi class SVM is used for training and classification.

Shape Classification. Shape is a very prominent feature of the object. Most of the time, the objects which share same affordances have their shape in common. We have classified shapes as Square, Cylindrical, round and 3D-boxy. We have first extracted the edges of the objects via Prewitt edge detector filter. Then we performed some morphological operations on the edges and the Hough matrix is calculated. Finally a curve fitting algorithm is used to find the similarities in shapes. We have compared our algorithm with [34], and observed that our efficiency is lower than it, but due to the complexity, we remained with our algorithm as the difference of the efficiency is not substantial. Currently, our shape detection classifiers accuracy is 78 %.

Fig. 4.
figure 4

The process of shape detection.

Color and Orientation Classification. For color classification, we have used simple histograms of the object images as features. The histograms of all Red channel, Green channel and Blue channel are used. The KNN algorithm is used to implement the classifier.

For the orientation, we have initially used rotating patches of object images with 900 variations and trained a SVM based object classifier to classify the Horizontal and Vertical version of the objects. But the classifier has not performed optimally due to different traits of object poses. Moreover, the object detection itself is a substantial challenge to handle and increases the complexities of the overall system to a significant. For time being, as the main focus of the current work is to depict the effects of attributes in the overall affordance detection we have manually input the orientation values of the object in our final affordance classifier.

Body Pose Classifier. For the objective to implement a robust body pose classifier, we had to first identify and segment human body from a cluttered scene. Then we had to acquire the body joint locations. We have first tried to use simple Part based method [35], where week classifiers are trained with HOG features to detect and track the body parts, but the results were not optimal. Later we have used the Microsoft Kinect sensor to capture RGBD images and got the articulated human skeleton by Kinect SDK. The Kinect Skeleton viewer function, that is a part of the support package of Kinect SDK, provides coordinates 20 body joints of detected human body robustly. We have considered only the 10 joints of the upper body (Shoulder center, Head, Left shoulder, Right shoulder, Left elbow, Right elbow, Left wrist, Right wrist, Left hand, Right hand). We have used these coordinates as the base features for human action pose detection. Our novel action pose detection classifier is inspired by the concept of [34]. We have represented the body poses not by the mere coordinates of the body joints but by the distance of each body joints from the head. This method helped us to offset view point variance and translation variance to some extent. For the classification we have used the K-NN algorithm.

For the human action pose classifier, we have also used the inner angles of the elbows as base features. The vector dot products were used to determine the angle (Fig. 5).

Fig. 5.
figure 5

Detected skeletons in different actions.

Human-Object Distance Classifier. For Human-Object distance attribute classifier, we have used the Euclidian distance between the object centroid and the human body joints (Skeleton joints, acquired by Kinect) as base features. A multidimensional SVM is used for the classification.

Relative aspect Ratio and Relative Spatial Location Classifier. Relative aspect ratio and relative spatial locations are the attributes which are only used in the case of multiple objects. Relative aspect ratio implies the comparison of the aspect ratio of one object to other. We have find that the relative aspect ratio gives us a useful insight of objects affordance in a multiple object setting. For instance for a pouring action, most of the time the larger object is Pour from object and the smaller object is pour to object. For the relative spatial location, we have decomposed each image frames into nine cells as: Center, Above, Bottom, Left, Right, Upper left, Upper right, Bottom left, Bottom right. We index the locations of each object by these cells and use them as base features.

4 Training and Inference

For training the affordance classifier, we have used 9632 images of different actions being performed. There are 22 action classes performed with 43 objects. 4 subjects (person) were used to perform the actions. The action classes are: (1) Spraying in the body (2) Chopping (3) Cutting (4) Drinking with both hands (5) Drinking with single hands (6) Eating snacks (7) Eating fruit (8) Ironing (9) Mopping (10) Opening poly (11) Opening box (12)Opening can (13) Opening jar (14) Opening packet (15) Opening tin (16) Pouring with both hands (17) Pouring with single hands (18) Spraying in the air (19) Stacking with both hands (20) Stacking with a single hand (21) Waving (22) Answering mobile phone.

For the training of the attributes (material, shape, color and parts) classifiers, we have used features extracted from objects images from different datasets such as ‘Caltech 256’ and SHORT-100 and also downloaded images from the web. For testing our model we have used Human-Object-Interaction images from known object classes (The affordance classes which are trained) and also novel object classes. In the test dataset, there are also instances where the objects are partially occluded and the human body poses are unknown.

5 Model Evaluation

We have tested our model with a test dataset of 3 subjects performing 22 actions with 18 objects. Total instances of the test dataset are 528. We have initially implemented our model with SVM, KNN and Bayesian networks to find the most suitable algorithm for our model (pilot testing). Due to the best empirical results, a Bayesian Network based method is used for constructing the final affordance model. For comparing these three algorithms a prototype test dataset was used which is different from the actual testing dataset.

We have compared our model with two baseline models. For baseline (a) we tested the models which used only human body pose as features for Human-Object-Interaction detection and for base line (b), the models which used the mutual contexts of human body poses and detected object classes for affordance detection.

Fig. 6.
figure 6

The comparison of our model with the base lines.

Figure 6 shows the comparative results of our attribute based affordance model with the base lines. The overall accuracy of our model is 67.85 %. This accuracy is acquired by testing the model with both known and unseen object classes. It shows the accuracy and generalization improves a substantial amount with our model. The overall accuracy of the base line algorithms are 61.18 % (Objects and body poses) and 56.3 % for body pose only.

6 Conclusion

In contrast with the current affordance detection models in the computer vision and robotics domain, we have implemented our model by considering mutual contexts of Human-Object and ambient environment. Moreover we have represented each context with a cluster of attributes. Due to the inclusion of multiple contexts and knowledge sharing capability within the attributes our model proved to perform more efficiently, semantically and has generalization quality.