Keywords

1 Introduction

The availability and accessibility of home monitoring technologies have enabled friends, family members, and caregivers to track the activities of older adults [10]. Tracking the activities of daily living (ADLs) of older adults is also important for measuring the safety conditions of older adults [14]. Smart homes are equipped with Internet of Things (IoT) devices that monitor the elderly and communicate real-time data/information to stakeholders [2, 13]. Ambient intelligence applications, such as motion and position detection, fall detection, telemonitoring, cognitive technology assistants, sensor-based wearable systems, and automated home monitoring, are among the tools to provide safety precautions for older adults during their daily activities [9, 10].

Among these ADLs, meal preparation is one of the most important activities of daily living  [6, 22]. Especially when used as part of the home maintenance assessment for people with or without loss of independence. Activities such as meal preparation have recently been explored to exploit the assistive capabilities of ambient assistance solutions to prevent dangerous situations [4, 18, 22]. In the context of meal preparation, we have identified 3 categories of so-called dangerous situations. The category of omissions concerns objects appearing on the cooking surface, the oven door open, and the burners left on without a pot on it [7, 14]. In the category of an absence of supervision, it is a question of being present in front of the stove during the cooking process. This category varies greatly depending on the temperature of the burners, the contents of the pans, the temperature of the oven, and the cooking time. The last category is the presence in the home during the use of the stove. We will focus on the dangerous situations of the forgetfulness category in the following.

Contextual awareness is particularly important in a high-risk activity such as meal preparation, where assistive technologies must be able to identify and assess problems as they arise and manage dangerous situations in real-time. Except that some elements of context are not observable by motion or contact sensors [3]. The integration of computer vision and deep regression network-based systems into smart stoves would provide more context to recognize certain dangerous situations during meal preparation. The combination of using a different camera, IoT sensors, and real-time processing of this data allows for the introduction of multiple objects tracking techniques [5, 12, 19].

Multiple object tracking (MOT) is a research topic related to computer vision [11, 16, 17, 21]. When we talk about MOT, it is largely about locating multiple objects, maintaining their identity, and generating their trajectories from an input video. As a mid-level activity in computer vision, multiple object tracking underlies high-level tasks such as pose estimation, action recognition, and behavior analysis [11]. It has many practical applications, such as visual surveillance, human-computer interaction, and virtual reality [11].

In this paper, MOT is used for the detection of objects appearing on the top of the stove to improve safety during meal preparation when a round is left hot with a cloth on it; (2) the ability to maintain or abandon the tracking of detected objects to take into account the evolution of the context. Since a dangerous situation can be caused by different types of objects appearing on the stove, we developed a real-time algorithm to extract multiple perspective views from a single capture device to track objects in real-time and stop tracking once the object leaves the camera’s field of view over the stove.

The remainder of this paper is organized as follows. Section 2 briefly introduces the COOK system. Section 3 presents our object-tracking pipeline. Section 4 presents how this algorithm is applied to the COOK system. Section 5 presents the results and a brief discussion. Finally, Sect. 6 concludes the work.

2 COOK: Cognitive Orthosis for CoOKing

The Cognitive Orthosis for coOKing (COOK) is an innovative, context-aware, stove-connected smart tablet application designed to optimize the independence of individuals with cognitive deficits during meal preparation [7, 18]. This application was originally designed to specifically target the cognitive deficits of people with moderate or severe traumatic brain injury [6, 14]. But now the system is designed for anyone who wants a safe environment for meal preparation. The COOK system addresses non-emergency situations. The system assists the person with tasks related to meal preparation (ingredients to prepare, planning, safety rules, etc.). It can detect pre-defined hazardous situations through various sensors, including motion and door sensors, ultrasonic range finders, door openings, flame and high-temperature sensors, and smart switches as shown in Fig. 1.

Fig. 1.
figure 1

Example of some sensors installed on the smart stove.

The main components are (1) a Self-monitoring Security System, (2) a Cognitive Assistance Appliance, and (3) a Configuration System. Only the self-monitoring Security System will be discussed in the rest of this paper.

2.1 Self-monitoring Security System

In autonomous operation, i.e. not depending on the other components of the system, the Self-monitoring Security System (SSS) has the role of monitoring all the so-called dangerous or hazardous situations [1]. While the user is cooking, if necessary, the SSS will progressively indicate the presence of a potentially hazardous situation that must be resolved within a predefined time frame. The SSS is designed to monitor three hazardous situations: supervision, omissions, and presence. The safety system collects information from the sensors, detects critical errors and hazardous situations, and, if necessary, turns off the stove and calls for help. Through the sensor infrastructure, the safety system can detect the temperature of the top of the stove, the presence of a person near the stove, the opening of the oven door, the temperature of each burner, the amount of electric current flowing on each burner, etc.

Several standard and predefined security rules are provided by the safety system. However, to monitor dangerous situations, the SSS uses sensors such as motion and door sensors, ultrasonic distance sensors, flame, and high-temperature sensors. These sensors are not able to distinguish between different objects that arrive at the burner surface. Some of these objects are kitchen towels, hands, cooking utensils, etc. Therefore, the SSS needs a preventive assistance model that captures each user’s idiosyncrasies, personalizes preventive assistance, and provides context-aware assistance regarding food preparation hazards. The objective of this version of the SSS is to extend its potential to detect new dangerous situations. By adding a camera to the existing sensors and a new real-time object-tracking model to the SSS. The following situations could be easily detected in conjunction with user context information.

  • Leave the active hotplate empty for several seconds.

  • Forgetting something on the cooking surface.

  • Real-time detection of a hand reaching for a hot surface.

  • Follow in real-time the objects that appear on the cooktop.

To sum up, to have a module that makes it possible to follow in real-time several objects that appear on the cooktop during the preparation of a meal.

3 Multi-object Tracking

Predicting the object’s location (bounding box) with class is called object detection [21]. By locating the potential object locations in each frame, object detection can provide observations for detection-based object tracking. Therefore, either in every frame or when the object first appears in the video, each tracking method requires an object detection mechanism [11].

Object tracking is one such application of computer vision where an object is detected in a video. Otherwise, it is the process of identifying the same object and keeping track of its location with the unique label as it moves around in a video. For example, you have a video of someone cooking a meal, and you want to track the location of the frying pan constantly throughout the video in real-time by estimating its trajectory. Object tracking can be done using one of the two approaches that exist in the field. One is called Single Object Tracking (SOT) and the other is called Multiple Object Tracking (MOT) [3, 16].

Multiple Object Tracking is when different objects are tracked at the same time within the same video or the same set of frames [17]. For example, you have a video of a meal preparation activity, and you want to track the location of the pans, spoons, and utensils continuously throughout the video in real-time by estimating their trajectory. The proposed real-time multiple object tracking (MOT) systems typically follow the tracking-by-detection paradigm [5, 12]. It includes (1) a detection model for the target location and (2) an appearance embedding model for data association. In other words, the proposed MOT approach has to accurately detect the objects in each frame and provide a consistent label for them. There are certain cases where the visual appearance of the object is not clear as long as there is a moving object in the food preparation activity. For example, a spoon may be hidden behind a frying pan. In all such cases, detection would fail while tracking would succeed. That’s why we combine these two modes to provide a safe environment for food preparation.

Detecting and localizing the object in a video in a fraction of a second and with high accuracy is expected and required from the proposed MOT algorithm. Due to the variety of stove burners and backgrounds with similar colors or textures as the object, this detection speed can be significantly affected. Another significant difficulty in the context of meal preparation is occlusion. This is when the object in question is partially or completely occluded by another object. A further problem is that an object in a video can have different sizes and directions. In food preparation, there are often two objects crossing each other. How do you know which is which? This problem is known as identity switching. Another problem with MOT, which is also important in detecting and recognizing objects, is called Motion Blur. Motion blur is when the object is blurred due to the motion of the object or the motion of the camera. As a result, the object no longer looks the same. A quick move of the hand in front of the camera to grab an object on the cooking surface or to turn the soup in the pot.

Two object tracking techniques are used to solve these problems: Detection-Based Tracking and Detection-Free Tracking [20]. In detection-based tracking, successive video frames are presented to a pre-trained object detector. The detector generates detection hypotheses that are used to build tracking trajectories. In detection-free tracking, a fixed number of objects must be manually initialized in the first frame. In subsequent frames, it then tracks these objects. It is not able to deal with the case where new objects appear in the intermediate frames.

Typically, objects move back and forth during meal preparation. For our context, this means that non-tracking detection is not appropriate. In addition, new objects appearing and disappearing must be automatically supported, i.e., recognized when they appear and forgotten when they disappear. Therefore, tracking-based detection could meet our needs. Finally, the tracker is used to make the remaining predictions, and the object detector is run for every n frames. For tracking over a long period of time, such as food preparation, this approach is very suitable.

3.1 General Algorithm

To track objects appearing on the cooking surface of the stove during a meal preparation activity in real-time, this study proposes a hybrid method. For this purpose, the Kernelized Correlation Filter (KCF) [8], which is particularly fast in tracking, is used in combination with the YOLO [15], which is an accurate and relatively fast detection model. The workflow of the proposed method is shown in Fig. 2. First, the objects in the video image were detected by the YOLO algorithm, and the surface target was continuously tracked by the KCF algorithm. After tracking a certain number of frames, the detection mechanism is re-introduced, and then the output of the target frame is obtained through the feature fusion algorithm based on machine learning, and when new targets appear in the field of view, the new targets are initialized and tracked.

Fig. 2.
figure 2

General Algorithm for Safe COOK using YOLO and KCF.

To summarize, an input in the form of a video file is processed in the pipeline, and the output is the bounding box of each object related to food preparation detected in the video frame, with its identity obtained from the tracking process, with the possibility of automatically adding an object or stopping tracking it. Two main concepts are proposed: (1) estimation model; and (2) data mapping. The estimation model is used to estimate the next bounding box of a detected bounding box, while the data mapping is used to map the predicted next bounding boxes to the actual or detected bounding boxes to link them into a unique identification number. The KCF filter must be used to update and predict the data with the equations described for each step to complete the tracking process. The value is a measurement from each frame. In this case, it is represented as the center of gravity of the bounding box found by YOLO. This process must be performed along the entire video stream, with each frame considered a different state and the predicted values updated recursively with the measurements.

3.2 YOLO Detector Model

This algorithm is called You Only Look Once (YOLO) because it identifies objects and their locations using bounding boxes by looking at the image only once [15]. YOLO is proposed for real-time object detection in high-speed video using a convolutional neural network architecture. It has several advantages over traditional methods, including high speed, low computational costs, and fewer background errors. Instead of processing each class separately, the YOLO network turns the detection problem into a regression problem where the image is used only once as the input network. This makes it much faster than traditional object detection algorithms to determine the location of detected objects in the video frames. The single deep convolutional neural network used by YOLO is capable of predicting all the bounding boxes and the class probabilities for those boxes at the same time, using features from the entire image. It is suitable for end-to-end training in real-time while maintaining good average accuracy. The YOLO object detector is fast because it works in one step. The image is first divided into an SxS grid; each grid cell predicts B bounding boxes, including the coordinates, the width, the height, the confidence values of the box, and the conditional class probabilities.

3.3 Kernelized Correlation Filter Tracker Model

Correlation measures how similar two patterns are, with more similar patterns being more highly correlated. It is obtained by the pairwise product of corresponding pixels. Correlation filtering, which uses trained sample images that capture the appearance of objects of interest, is a widely used method in visual tracking applications. Unfortunately, the large sample size requirement of correlation filters becomes computationally demanding, conflicting with real-time requirements. However, the limitation of the number of samples can be at the expense of performance. A novel idea that significantly increases the tracker’s computational speed is to exploit the circulant structure of the data matrix for larger sample sizes and to extend it using a kernel trick. This is known as the Kernelized Correlation Filter (KCF) tracker [8]. To perform fast tracking, the KCF reveals the relationship between the circulant matrix and the discrete Fourier transform. The circulant matrix rows consist of the target model and the model’s cyclic shifts. The ridge regression used to learn the image windows is defined in the frequency domain using the Discrete Fourier Transform. There are two main phases, training, and detection, in the KCF tracking algorithm. In the training phase, there is a training of the classifier based on a set of samples. Three-dimensional color or oriented gradient histograms are used as feature vectors in this phase. In the detection phase, a new sample is detected by the operation of the Fourier transform. Typically, this involves multiplying the weights obtained after training by the frequency domain test images to find the possible position of the target. Before any computation can be performed, each tracker requires the extraction of frames. Once the frames have been extracted, the KCF needs a training model with the image patch at the initial position of the target.

4 Multi-object Tracking in COOK

In this study, we propose a fast object-tracking pipeline that combines accurate YOLO detection and fast KCF tracking to track specific objects for the preparation of meals. Typically, several objects appear and disappear quickly on the cooking surface during a meal preparation activity. For this reason, we have added an extra step to perform a fast drop tracking. This solution was chosen because it can be useful to work in an environment where resources are limited, such as in an embedded system. The overall architecture of the implemented system is shown in Fig. 3. The system uses deep learning, OpenCV, and Python to detect objects in images and video streams using the YOLO object detector. We created an additional dataset of images with different types of kitchen utensils to train the YOLO models and test the proposed methods.

Fig. 3.
figure 3

Architecture of the Safe COOK system, showing how the MOT can provide more information in the context of hazardous situation detection.

Fig. 4.
figure 4

Sample frames from our training dataset. Showing detection of hands, stoves, pans, and spoons.

In this architecture, the camera collects data from the stove and uses the ZeroMQ (ZMQ) protocol to transfer the data. ZeroMQ is a high-performance asynchronous messaging library intended for use in distributed or concurrent applications. This information is then sent to OpenCV, which runs the proposed solution for multiple object detection. The proposed real-time method, through an object detection, localization, and correction process, automatically detects and tracks the utensil appearing in a sequence of images. The central coordinates (x, y) of YOLO are updated every thirty frames. This is a trade-off between accuracy and speed.

A home kitchen with a stainless steel countertop and a spiral kitchen layout was used to record the meal preparation videos. The videos were recorded with a high-resolution camera. The camera was connected directly to a 4 GB Raspberry PI 4 model B+. Image transport was done using imageZMQ, a set of Python classes that transport OpenCV images between computers using PyZMQ messaging. Experiments were done for the YOLOv5 network. Existing data like OpenImage provides a large set of trained and annotated images (bounding box) with 600 classes. However, this number of classes is limited, and some classes do not have enough images to do the training properly.

Therefore, data from OpenImage and Google Colab were used for training on an IntelR Core TM i7-3770 CPU 3.40GHz 4 cores with 8 threads and Raspberry Pi 4 with 4 GB. Mainly we labeled the following classes: Pan, Hand, Stove, and Spoon as shown in Fig. 4. There are many datasets for testing object tracking methods in videos. However, these datasets do not consist of meal preparation videos. Therefore, to create a dataset for our specific safe cook app that tracks meal preparation, we tagged 3500 frames. From this dataset, 1000 frames are used for the training of the YOLOv5 models, while the remaining 2500 frames are used for the testing.

5 Results and Discussion

This solution adds further context to the detection of hazardous situations. It provides data that is combined with sensor information to offer greater safety during meal preparation. However, some problems need to be solved to make it more optimal.

Foremost, YOLO does not recognize the uniqueness of certain objects from one image to the next. For example, there is no way to be sure that the detected pan is the same pan that was detected in the previous image. Our solution is to compare the position of objects detected by YOLO with the position of objects detected by tracking. So if we track an object of the same type at the position of the YOLO detected object (with some error depending on the implementation), we consider them the same objects. With this approach, we are now able to recognize the type and the instance of the objects. For example, we will be able to distinguish between pan1, pan2, and so on.

The second problem is that it is possible for the tracking to start tracking a similar object instead of the previously detected object. This happens even when using KCF, which was specifically chosen for its ability to easily stop tracking when the object is lost. This is most likely to happen with objects that are in proximity to identical objects. For example, how can you track an object that has the same color as the image background? How can you be sure that you still have the right object on the track? The solution chosen for this problem is to use object detection (YOLO) with a very low confidence level.

Fig. 5.
figure 5

Performance for tracking and detecting objects in interaction

The objects detected with this confidence level are compared to the tracked objects. In other words, does the object on the track have a minimal chance of being the object on the track? If this is not the case, then we will stop tracking this object. Since the tracking is usually stopped because of this condition and not because of the loss of tracking, the overall result is a detection algorithm that makes it easier to detect objects (minimum confidence level), but at the same time correctly identifies objects (higher initial confidence level). On the other hand, the effectiveness of monitoring is greatly reduced with this solution. Therefore, it is ideal to find another solution that would easily drop tracking.

The object detection and tracking techniques of YOLO and KCF can be easily substituted or replaced for different models due to the modularity of our algorithm. We replaced KCF with other tracking algorithms to evaluate the effectiveness of our tracking loss algorithm. Starting from the simple use of YOLO without tracking, then YOLO with single tracking without detecting multiple instances of objects, then YOLO with tracking of multiple instances of objects, and finally YOLO with tracking of multiple instances of objects with drop tracking. The sequences obtained are shown in Fig. 5.

Table 1. Performance evaluation of the different methods with and without tracking.

Note that the multiple tracking algorithms without drop tracking have the best performance metric. However, it often tracks the wrong objects. So on the whole it is not optimal. In addition, although YOLO alone has a better performance metric than the full algorithm, the full algorithm (MOT+drop) has better detection and is still able to do detection at sufficient speed (real-time) to be useful in the context of the Safe Cook project. The performance quantified on a video of the experiment is shown in Table 1.

6 Conclusion

SafeCOOK is a real-time system for tracking multiple objects associated with preparing a meal while using the Cognitive Orthosis for CoOKing (COOK). It provides more contextual information to the safety self-monitoring system for the detection of potentially dangerous situations that are not currently managed.

The proposed system is a low-cost monitoring system without a graphics card based on YOLOv5 and KCF. The main idea is to reduce the limitations of the resulting tracking system by combining a tracker and a detector to take advantage of both methods. In order to meet the requirements of the fast and frequent appearance and disappearance of cooking utensils on the cooking surface, we propose in this paper an algorithm that fuses the detection, the object tracking, and the fast object dropout tracking. The fusion algorithm is designed in such a way that the detection and tracking information is complementary in order to fulfill the tasks of detecting hazardous situations during food preparation in different scenarios. The findings show that the detection and tracking fusion algorithm can effectively detect and track kitchen tools in different scenarios, and the success rate exceeds that of a single detection and tracking algorithm.