1 Introduction

Equipping computers with the ability to understand the physical world is an important goal of artificial intelligence [2, 8]. In recent years we moved closer to reaching it thanks to the rise of large pre-trained transformer-based models. These models may be taught using the language model objective, which requires them to learn to predict the next word in a given sequence or guess a masked word in a given text passage. Being trained over large textual corpora, these models learn world-related knowledge that helps them choose the right word.

However, a subset of knowledge called commonsense knowledge is not explicitly stated in texts written by humans. Consider, for instance, a presupposition [18] a trolley is light enough to be capable of being pushed by a person related to a statement somebody pushed a trolley. As some ideas are obvious to us and we expect that everyone is aware of them, we usually do not write about them. This problem is especially manifested regarding commonsense physical knowledge on which we concentrate in this paper. This could be problematic, e.g., when using language models in embodied agents that need to interact in the physical world.

Fig. 1.
figure 1

An example of a question requiring affordances. We intuitively know that plugging the device into a socket is not enough to turn it on.

To address this issue, attempts to formalize commonsense knowledge are made. The promising idea is to use that formalized knowledge and inject it into pre-trained language models so that they can understand our world better. In this work, we utilize the notion of affordances, i.e., relationships between agents and the environment denoting actions that are applicable to objects, based on the properties of the objects (e.g., whether an object is edible, or climbable) [8] (Fig. 1). We extract knowledge about affordances from a knowledge graph to enrich the knowledge of popular pre-trained models. This paper’s primary research question is whether injecting commonsense knowledge concerning affordances into pre-trained language models improves physical commonsense reasoning.

2 Related Work

Ilievski et al. [11] attempted to group commonsense knowledge into dimensions to verify which of them exactly impact models and concluded by stating that temporal, goal and desires are important dimensions for the models tested. On the other hand, a set of actions that a given object can make in a given environment are used in visual intelligence in the context of classification and labelling [28]. The authors focused on images, unlike our natural language-oriented work.

Commonsense reasoning in this paper is understood as an ability to make assumptions about ordinary situations humans encounter daily in their life. Among datasets that relate to this concept, there are ones that deal with multiple-choice questions [24, 27]. In this paper, we use PIQA [2], which is a recent dataset focused on physical commonsense. The authors of the dataset prove that current pre-trained models struggle with answering questions collected in PIQA since they cover knowledge that is rarely explicitly described in the text (e.g., one has to choose whether a soup should be eaten using a fork or a spoon).

Some popular approaches to solve tasks requiring commonsense knowledge use GPT [7] or BERT-like models such as BERT [6], RoBERTa [14], ALBERT [12], or DeBERTa [10]. As they all follow the language model training objective, we expect they have some world-related knowledge. Results on PIQA using fine-tuned GPT model [3] achieved 82.8% accuracy. Fine-tuning such a model on another task seems to improve its performance consistently [19]. Recently, however, a DeBERTa-based model took the lead, achieving 83.5% accuracy on the leaderboard. There are also PIQA baselines based on BERT [2], but they score lower than DeBERTa and RoBERTa, which seem to be better when it comes to commonsense and overall performance on the aforementioned datasets, especially with highly optimized training hyperparameters [14]. It also appears that attention heads do capture the commonsense, which is encoded in graphs [5]. Moreover, UNICORN, a universal commonsense reasoning model trained on a new multitask benchmark using T5 (roughly 2 times bigger than BERT), where PIQA is a part, achieved 90.1% accuracy [15].

More specialized solutions include external resources that are used for fine-tuning or enriching the model output, such as graphs with labeled edges as interactions between actors [22] and relations between causes and effects [20]. Evaluations using such resources include inquiring a model for additional information [21] or combining data with graph knowledge in BERT models for classification [17]. There are also works that aim to re-define the distance between words using graphs [16] and generative data augmentation which seems to be a kind of adversarial training [26]. Recently, it was shown that adapter-based knowledge injection into BERT model [13] improves the quality of solutions requiring commonsense knowledge.

3 Affordances

The notion of affordances was introduced by Gibson [9] to describe relations between the environment and its agents (e.g., how humans influence the world). This relationship between the environment and an agent forms the potential for an action (e.g., humans can turn on a computer). Affordances help study perception as the awareness of the possibility to do certain actions related to the agent’s world perception. As possibilities of actions – affordances – they are very natural for humans. This intuitively known knowledge may be underrepresented in internet-based textual corpora, while in some domains, such as robotics [1], one of the key reasoning tasks is inferring the affordances of objects (possible actions that can be accomplished with a given object at hand by a robotic agent).

For our use case, we can introduce several restrictions that may help to identify affordances: (i) Affordance must explain some kind of relation between two agents or concepts. This means it needs to touch on the aspect of how those two items coincide with each other or influence each other. (ii)Affordance cannot be a physical connection. Affordance is a metaphysical concept (a possibility of action) that connects two items. Thus, a cable connecting two computers is not an affordance. (iii) Affordance cannot be a synonym. While synonyms are connected by definition, affordance’s goal is to explain how an agent connects to the counterpart in our world, not by just simply stating they mean the same. (iv) Affordance cannot be a relationship based on negation. There are many concepts out in the world that have some sort of relation. However, an affordance must in some way impact or be able to affect one of the agents.

4 Datasets

In this work, we use two datasets – PIQA and ConceptNet. PIQA, or “Physical Interaction - Question Answering”, is a dataset of goals with two possible answers (further referenced to as solutions) provided. Only one of them is correct and choosing which requires some physical commonsense knowledge. For example, asking about how to eat a soup, our model should know that we want to use a spoon instead of a fork. PIQA is divided into train, validation, and test set.

Fig. 2.
figure 2

Input differences between experiments in the architecture of the solution.

ConceptNet is a knowledge graph proposed to represent the general knowledge involved in understanding language, allowing applications to better understand the meanings behind the words  [23]. It is based on data sources such as WordNet, OpenCyc, and Wikipedia. From all possible properties provided in the graph, we chose the ones that match the affordance requirements defined in Sect. 3. These are: CapableOf, UsedFor, Causes, MotivatedByGoal, CausesDesire, CreatedBy, ReceivesAction, HasSubevent, HasFirstSubevent, HasLastSubevent, HasPrerequisite, MadeOf, LocatedNear, and AtLocation.

5 Method

To inject the knowledge extracted from the ConceptNet graph, we need to identify appropriate subjects of the properties listed in Sect. 4 so that the objects related to a given subject via one of the selected properties may serve as an affordance. To achieve this goal, we extract keywords for each question and possible answers from PIQA using the tool YAKE [4]. The keywords found are then linked to ConceptNet. However, if no aforementioned subset of chosen properties is found in the context of a linked entity, we use a definition from the Wiktionary [25] as a fallback. The affordances selected are then passed to a model as part of an input representing a question and an answer pair. The affordance (or a definition from Wiktionary) is tokenized and placed after the last [SEP] marker following the input scheme: [CLS] QuestionTokens [SEP] SolutionTokens [SEP] AffordancesOrDefinitionsTokens. Such an approach is in line with the original experiments with PIQA presented in [2], where similarly each question-solution pair is processed independently in the same manner and the embedding related to [CLS] token representing the whole context is processed by a single feedforward classification layer. We utilize the same approach simply adding affordances to the input so that the [CLS] token is aware of these (Fig. 2). With such a preprocessed input each of the base models is finetunned on the training set and then the results are obtained through the use of the validation set of PIQA. Preprocessing is done before the training begins and therefore it is the same on both sets of data.

6 Evaluation

We grouped affordances into 4 scenarios: (i) standalone aims to collect as many affordances as possible from all considered properties related to extracted keywords. These are then connected as sentences and added to the input as text. (ii) just first, extracts only the first affordance from a given keyword – the one that is the most important for the answer (meaning, we iterate by answer keywords first). (iii) definition adds affordances as well as Wiktionary definitions to the knowledge part of the input, merging both solutions. (iv) complementary aims to add definitions only when we lack any affordances, which is almost 87.4% of cases. This way, the number of separators in the input stays always the same but has either affordances or definitions given in the same place.

As PIQA provides a separate test set, we evaluate our classifiers on this subset using accuracy as a metric, which is a reasonable choice since the dataset is balanced (50% of examples should choose the first solution and the remaining ones the second one). We compared several popular BERT-based models, as they were proved to be good choices in the context of commonsense reasoning tasks. Some of them, like RoBERTa-large, are available on PIQA’s leaderboard for comparison. However, we did not experiment with the top-ranked models like GPT-2 and DeBERTa since they consist of over 1.5B parameters, which makes them hard to fit into GPUs. Thus, we limit our research to popular baselines.

Table 1 provides a summary of accuracy for various models when baseline (no affordances), definition, and affordance scenario is concerned. As there are 4 possible affordances scenarios described above, here we report the scores obtained from the best scenario. Because each model was trained on Wikipedia being part of the training set, we can draw an interesting conclusion: adding definitions from Wiktionary (already seen in the training phase) impairs the overall performance of each model. Conversely, affordances seem to help the overall results on average, especially in cases of bad performance on the baseline, such as the ALBERT model – improving by almost 4%. Unlike the previous method, which seems to worsen the overall results, affordances might be a good way to inform the model about our physical world. In general, we see that injecting affordances is beneficial – in all tested models the accuracy increased.

Table 1. Model accuracy from three viewpoints: baseline – no additional knowledge, definition – knowledge from Wiktionary, affordance – affordances from ConceptNet.

An in-depth analysis of different types of affordances creation methods is summarized in Table 2. We can observe that the methods based on just the affordances seem to be better – for every model, one of the two methods that only use affordances obtains the highest accuracy. This observation solidifies the hypothesis that language models lack certain knowledge conveyed with affordances.

Table 2. Accuracy for various settings: Standalone – all possible affordances, Just first – only the first affordance found, Definition – all possible affordances and definitions, Complementary – only adding definitions when no affordances have been found.

7 Conclusions

We investigated how language models respond to commonsense physical knowledge and how well they understand the subject. To this end, experiments were conducted to determine how the incorporation of commonsense knowledge into the input of the language model influences the results. This was contrasted with the normal encyclopedic definitions and results without any additional knowledge. To gain commonsense knowledge, this work introduces the concept of affordances to machine learning and answering questions using ConceptNet.

Different types of affordances were also looked at. The paper presents 4 different affordance injection methods with a description and implementation as well as a comparison between them. Surprisingly, they all lead to the same conclusion that the Wikipedia definition knowledge does not help the models to answer the questions – what is more, it usually even makes results worse. Of the methods tested in this paper, only those that rely solely on affordances are of value, namely the one that lists all possible affordances, and the one that lists only one, most important, affordance. These methods turned out to be the most effective in the generated experiments. We published the source code onlineFootnote 1.