Things in the air: tagging wearable IoT information on drone videos

Drones have been applied to a wide range of security and surveillance applications recently. With drones, Internet of Things are extending to 3D space. An interesting question is: Can we conduct person identification (PID) in a drone view? Traditional PID technologies such as RFID and fingerprint/iris/face recognition have their limitations or require close contact to specific devices. Hence, these traditional technologies can not be easily deployed to drones due to dynamic change of view angle and height. In this work, we demonstrate how to retrieve IoT data from users’ wearables and correctly tag them on the human objects captured by a drone camera to identify and track ground human objects. First, we retrieve human objects from videos and conduct coordination transformation to handle the change of drone positions. Second, a fusion algorithm is applied to measure the correlation of video data and inertial data based on the extracted human motion features. Finally, we can couple human objects with their wearable IoT devices, achieving our goal of tagging wearable device data (such as personal profiles) on human objects in a drone view. Our experimental evaluation shows a recognition rate of 99.5% for varying walking paths, and 98.6% when the drone’s camera angle is within 37°. To the best of our knowledge, this is the first work integrating videos from drone cameras and IoT data from inertial sensors.

features, and require close contact to specific devices. Hence, these solutions are not applicable to drones with changing height and view angle.
In this work, we present an approach to identifying and tracking human objects in the videos taken from a drone and correctly tagging their personal profiles retrieved, through wireless communications, from their wearable IoT devices. Figure 1 shows what our system can achieve by integrating a drone camera and wearable devices. To the best of our knowledge, this is the first work correlating IoT data and computer vision from a drone camera. Through combining IoT and computer vision, the future aerial surveillance systems could be even smarter and more user-friendly. For example, in Fig. 1, the visualized information can even include users' personal profiles and past activities before they actually entered the camera view. Potential applications also include military training and construction site monitoring where people are required to wear badges. Whenever needed, the fusion server can also send alert messages to specific persons (for reasons such as the persons enter a dangerous region found by the drone).
We propose a data fusion approach that combines videos with inertial sensor data. Our system has an aerial camera, some wearable devices, and some ground markers. A drone carries a camera to continuously record videos. A crowd of people are in the drone view and some of them may put on their wearable devices, each of which is an IoT device with some inertial sensors. This allows us to collect user profiles and motion data. Both video data and IoT data are transmitted to a fusion server for correlation analysis. The correlation procedure consists of four steps. First, we retrieve human objects from videos by using a deep learning network. Second, we transform these human objects from a pixel space to a ground coordinate to handle the change of drone positions. We use some ground ArUco markers [13] for the transformation. Third, we extract human motion features from both video data and inertial sensor data. Finally, a fusion algorithm is applied to measure the similarity of each pair of normalized motion feature series, one from videos and the other from wearable devices. By quantifying all-pair similarity scores, we are able to couple human objects with their IoT devices, achieving our goal of tagging wearable IoT data on drone-recorded videos.
To validate our idea, we conduct a number of experiments by changing peoples' walking paths, the number of people, and the drone's view angle. Experimental results show over 99.5% in accuracy for two people walking in a variety of walking paths, including circular, non-interleaving, interleaving, and random. The accuracy is 98.6% when the drone's camera angle is within 37°, showing that our system can handle the change of drone positions during PID. Our work makes the following contributions: • The proposed surveillance system works for drones by firstly and novelly integrating with wearable IoT devices. • Our fusion approach correlates two independent sources, video data and inertial data, in an interesting way, thus achieving PID and tracking with less effort and even without seeing human biological features. • Our system tackles the changing height and view angle issues of drones, and can work without any data labeling process. However, if certain training effort (such as face labeling) is available, our system can leverage it and further improve its accuracy.

Related work
The PID issue has been extensively studied in the computer vision and IoT fields. It plays a critical role in many security [14], surveillance [15], and business-intelligence [16] applications. The work [14] presents a smart homecare system by utilizing PID and behavior identification methods. The work [15] improves the PID performance in surveillance system by combining deep learning with multiple metric ensembles. Reference [16] proposes a PID method for human-robot interaction by using a simplified fast region-based convolutional network. In this section, we briefly review the state-of-the-art vision-based PID approaches, biometric-based PID approaches, multi-sensory fusion solutions, and drone-based PID solutions.
In the area of computer vision, PID has been widely studied in pedestrian tracking [17][18][19], and face recognition [20,21] recently. The work [17] realizes PID by video ranking using discriminative space-time and appearance feature selection. In [18], the detected pedestrians across a whole video are linked by fusing human pose information. Reference [19] addresses the problem of pedestrian misalignment in a video sequence by combining appearance features and pose features. How to conduct face recognition using a deep convolutional neural networks (CNN) is discussed in [22]. Din and Ta [20] proposed a comprehensive framework based on CNN for video-based face recognition. Reference [21] addresses low-resolution face recognition in the wild via selective knowledge distillation. However, vision-based PID solutions are sensitive to lighting condition, occlusion, and viewing direction. The work [23] investigates the face recognition performance on drones, including Face++ [24], Rekognition [25], and OpenCV [26]. It points out three obstacles for face recognition: (i) the small-sized facial images taken by drones from long distances, (ii) the pose variances introduced by large angle depression, and (iii) the large labeled datasets required for model training.
Biometric-based PID approaches identify human objects by measuring their unique physical or behavioral characteristics, mainly including hand geometry, fingerprint, and iris. Reference [11] studies curvature features for 3D fingerprint recognition, while [27] proposes a finger-vein-based biometric identification system under different image quality conditions. Iris recognition based on human-interpretable features is proposed in [12]. Information fusion by combining multiple sources (e.g., face plus fingerprint, and face plus iris) is studied in [28]. However, fingerprint and iris recognitions need users' biology features, and require close contact to specific devices. Hence, biometric-based solutions are more suitable for indoor applications such as smart home [6], physiology monitoring [7], and behavior analysis [8]. Applying them to drones would be difficult.
Muli-sensory fusion solutions identify and track human objects by fusing data from multiple cameras, inertial measurement units (IMUs), and IoT devices. In general, when partially independent information is available from multiple sources, their combination may improve accuracy over a single-source solution. Reference [29] uses the features from static cameras and IMU sensors to identify and track multiple persons. References [30,31] propose a deep learning multi-modal framework to match silhouette video clips and accelerometer signals to identify subjects. A through-wall PID system based on gait features from WiFi signal and video-based simulations is proposed in [32]. However, these solutions require static cameras with their lens parallelling to the subjects, making it difficult to apply to highly dynamic drone views and angles.
Drone-based PID solutions aim to identify and track human objects from drone videos. Reference [33] studies multiview human motion capturing with an autonomous swarm of drones. The work [34] presents a large-scale visual object detection and tracking benchmark, named VisDrone2018. The work [35] develops fully CNNs for human crowd detection for drone-captured images. However, detecting human objects and tracking them do not imply getting their IDs. Reference [36] presents an introductory study pushing PID on mo-bile platforms, such as drone, and introduces some differences and challenges compared to standard fixed cameras. One major challenge is drone's continually varying view angle and position. Reference [37] uses thermal cameras on UAVs for human tracking. Reference [38] proposes a visual representation for drones to detect persons. The study [39] combines color feature descriptors and frontal face perception for UAVs to track walking persons. Reference [40] proposes multiple person detection and tracking method using CNN and Hungarian algorithm in drone images. However, these solutions may suffer from accuracy due to changing environment factors and target appearances.

System model
We see a new opportunity from the widely accepted IoT devices, such as smart watches and smart phones, which have become virtually the unique identity of a person. Most wearable devices are equipped with some inertial sensors and, more importantly, have access to the owners' personal profiles that may reveal (under the owners' consent) a lot of critical information (such as identity, age, sex, citizenship, social networks, past activities, purchase records, etc.). Our goal is to study PID in an aerial platform, such as a drone, by integrating an aerial camera and wearable IoT devices. Identifying a specified person from a drown view and knowing his personal information become not so difficult with the help of wearable devices. We can visualize not only people's IDs, but also their personal profiles, even under a lot of occlusions   Figure 2 shows an imaginary smart surveillance scenario in a square, which may shed some light on how our system works. Figure 3 shows the general concept of our data fusion model, where both camera and wearable devices try to capture common features for fusion purpose. Figure 4 shows our data processing flow, which consists of six main modules.

People detection and tracking
The drone camera provides video streams. The image frame at time index t is denoted as F t . We then use You Only Look Once (YOLO) [41] to retrieve all objects that are recognized as human. The bounding box of the i-th human object in frame F t is denoted as B P t (i) , where P means pixel space. From continuous images, a human object's walking trajectory is derived from continuous bounding boxes by Simple Online and Realtime Tracking (SORT) [42]. We denote by tr(B P t (i)) the trajectory of B P t (i) that happened before F t . If B P t (i) is a newly appeared object, then tr(B P t (i)) = � . So the trajectory up to time t is tr(B P t (i)) + B P t (i) ( + means concatenation). Note that since we only concern about a person's location, a bounding box is only represented by its center (in pixel space).

Coordination transformation
Due to mobility, the drone's view angle and location may keep changing, making camera coordinate change as well. Therefore, the above trajectory tr(B P t (i)) + B P t (i) may be inconsistent. To maintain a fixed coordinate, we place four ArUco markers [12] with a visible size on the ground. An ArUco marker is a synthetic square composed by a wide black border and an inner binary matrix which determines its identifier. The black border facilitates fast detection and the binary codification enables quick identification with a certain error detection and correction capability. These markers allow us to transform a pixel location to a global ground space with respect to these markers' locations. A transformation matrix M t for time t will be calculated. The ground coordinate is written as M t (B P

Tagging and visualization
Based on the pairing output PAIR, the final step is to integrate the result with a target application. We use tagging IoT information on human objects in a drone view as an example. This allows us to visualize not only people's IDs, but also their personal profiles, which may lead to a lot of interesting applications, such as military training, pedestrian monitoring, homeland security, group tracking, business data analysis, people flow analysis, and so on, as illustrated in Fig. 2.
We remark that the above process is all based on the users' willingness to collaborate with our system. Nevertheless, our fusion system is still able to distinguish those people who collaborate and who do not. For the latter, we can tag them as "unknown persons".

Fusion methodology
We have introduced our fusion architecture above. Below, we provide more technique details of some specific modules.

People detection and tracking
For the drone videos, we retrieve all human objects from each image by YOLO [41], and then establish frame-to-frame association to connect these human objects by SORT [42]. Let B P

Coordination transformation
To maintain a consistent coordinate system, we place four ArUco markers [12] at four corners on the ground. Note that this requirement can be removed by some modern computer vision technologies. Figure 5 shows some ArUco markers [13]. We use these markers as invariants for projecting a pixel in a video frame onto our viewing plane (i.e., the ground space). Given a bounding box B P t (i) , which is a pixel, the transformation process involves the following steps.

Marker detection
Given an image frame where at least four ArUco markers are visible, we use the function detectMarkers of OpenCV for detection. Note that we only need to detect two markers if the recognition area is of a symmetrical shape. Let the four markers be found at the pixel coordinates, P 1 (x 1 , y 1 ), P 2 (x 2 , y 2 ), P 3 (x 3 , y 3 ) and P 4 (x 4 , y 4 ) . During marker deployment, we also have their coordinates in the ground say,

Transformation matrix generation
Using the locations of these markers on the ground as invariants, it is possible to compute a transformation matrix from the pixel space to the ground space. SiFT [43] is a feature detection algorithm in computer vision to detect and describe local features in images. It is widely used in applications such as object recognition, robotic mapping and navigation, and image stitching. It is also available in OpenCV. We use findEssentialMat in OpenCV to derive the transformation matrix M t .

Coordinate transformation
By M t , we then conduct the coordinate transformation: For example, Fig. 6a is a drone image. After transforming every pixel to the ground space, the ground image looks as Fig. 6b, where the four markers do form a square. Note that at least two markers are needed for the transformation.

Video feature extraction
Recall that tr(B G t (i)) is the trajectory of B G t (i) at the ground space before time t. From trajectory tr(B G t (i)) + B G t (i) , we derive three motion-related features as follows: 1. Acceleration feature F V t (i).f 1 : Since the sampling rate is fixed, the second derivation of tr(B G t (i)) + B G t (i) is acceleration. So we set this feature value to d(x 3 , x 2 ) − d(x 2 , x 1 ) , where function d() returns the Euclidean distance of two points and x 1 , x 2 , and x 3 are the last three points in tr(B G t (i)) + B G t (i) , in that order. 2. Orientation feature F V t (i).f 2 : This is the facial direction of a human object. We suggest that this feature can be derived by the last few points of tr(B G t (i)) + B G t (i) together with the body shape in the bounding box at t. 3. Rotation feature F V t (i).f 3 : This is the change of angle of the last three points in tr(B G t (i)) + B G t (i) . We suggest that this feature can be calculated by a(������ ⃗ where function a() returns the angle of a vector.
The above three features are derived for each bounding box B G t (i) . We make some remarks below. First, since a trajectory may not contain enough data points because of occlusion and object detection errors, we will mark the corresponding feature values as "unknown" if this is the case. Second, since people's walking behaviors may not be detectable if the video sampling rate is set to 30 fps, some down-sampling may be conducted on the raw data.

Sensor feature extraction
We assume that our wearable IoT devices have built-in inertial sensors, including accelerometer, gyro, and magnetometer field sensors. This allows us to monitor and retrieve users' motion-related features. For each wearable device j and its inertial data I t (j) at time t, we derive the same three features as follows.
1. Acceleration feature F S t (j).f 1 : We obtain the acceleration feature directly from the 3-axis accelerometer after removing the component of gravity. 2. Orientation feature F S t (j).f 2 : Device orientation is the position of the device in space relative to the Earth's magnetic north pole. We obtain the orientation feature directly from the geomagnetic field sensor.
(1) 3. Rotation feature F S t (j).f 3 : If wearable IoT device j is a newly appeared device or |F S t (j).f 2 − F S t−1 (j).f 2 | is too small, the rotation feature F S t (j).f 3 is set to 0, to eliminate measurement errors of inertial sensor data. Otherwise, the rotation feature F S t (j).f 3 is set to F S t (j).f 2 − F S t−1 (j).f 2 .

Data fusion
The data fusion module calculates the correlation of video and sensor data, which consists of two tasks: Similarity Scoring and Paring. First, we measure the similarity score of any pair of normalized motion feature series, F V t (i).f k and F S t (j).f k , k = 1..r . The scores for all features are then combined. By quantifying all-pair similarity scores, we are able to couple human objects with their IoT information. To achieve this goal, we need to address the asynchronous time and the warped data problems due to distributed sensors.
1. Time synchronization: Cameras and wearable devices have their own clocks and sampling frequencies. Communications may also suffer from errors and delays, making it hard to synchronize all these data. 2. Warped data series: In wireless communications, data loss is inevitable due to channel noise and network congestion.
In the worst case, some data may be corrupted and can not be recovered.
DTW [44] is an algorithm for measuring similarity between two time series, which are sequences of observations, measured at successive times, spaced by time intervals. With DTW, we compare our time-specific feature points from videos and sensors regardless of the above two problems. With this, the Similarity Scoring sub-module generates a similarity matrix indexed by each pair of F V t (i) and F S t (j) . To evaluate the difference between two time-series data X(i) and Y(j), i = 1, … , |X| and j = 1, … , |Y| , DTW is derived based on dynamic programming: where D(i, j) is the distance between X[1 : i] and Y [1 : j] with the initial condition D(1, 1) = d(X(1), Y(1)) andd(X(i), Y(j)) is the distance function between two data points. The goal of DTW is to find a mapping path between X[1 : i] and Y [1 : j], such that the total distance on this mapping path is minimized.
In our case, we evaluate the similarity for each feature between any pair of F V t (i) and F S t (j) . These similarity scores must be normalized first. The purpose of normalization is to fairly treat each feature. The normalization factors are usually derived from experiments. Let us denote by D(F V t (i), F S t (j)) the combined similarity score (including all r features) between F V t (i) and F S t (j) , where a smaller value represents higher similarity. The Pairing sub-module uses these similarity scores and matches human objects with their wearable devices by a greedy approach. Specifically, for each human object i, we find its best matched wearable device Match(i) as follows: Then, the matched pair (i, Match(i)) is added to a set PAIR. This process is then repeated to find the next pair (paired ones will not be considered in the future process). The final set PAIR is the output. The pairing results couple human objects with their IoT devices, achieving our goal of tagging wearable device data on human objects in drone-recorded videos.

Performance evaluation
We implement a prototype of aerial surveillance system with the following components: (i) a DJI Spark, which has a camera with a 1/2.3 CMOS sensor of 11.8 megapixel resolution, a viewing angle of 81.9, and a sample rate of 30 fps, (ii) some wearable devices are simulated by smartphones (HTC 10) running the Android platform by setting their sensors' sample rate to 30 Hz, and (iii) a server equipped with an Intel Core i7-8700HQ processor, 32GB RAM, and a GeForce GTX1080 Ti GPU to conduct data fusion.
The experimental scene is shown in Fig. 7. We place four AuUco markers with the size of 42 cm × 42 cm on a square ground of 5 m × 5 m. There are a number of people walking in the square ground, all putting on wearable devices on their chests. An aerial camera keeps monitoring the square and continuously recording videos. Figure 8 shows eight 6 Experimental results Figure 9 shows the PID accuracy for the above 8 cases. Figure 10 shows some visualization results, where (a) contains two people walking in interleaving paths, (b) contains two people walking in random paths, (c) contains four people walking in random paths, and (d) is the case of drone flying within 67° angle.

Varying walking paths
As shown in Fig. 8, we design four types of walking paths, including circular (Case 1), non-interleaving (Case 2), interleaving (Case 3), and random (Case 4). In interleaving walk (Fig. 10a) or random walk experiments (Fig. 10b), the walking paths of volunteers interleave each other sometimes, causing occlusion in videos from time to time. However, since our data fusion model consider long-term inertial data, it is still possible to get correct pairing. As shown in Fig. 9, the average of MOTA achieves 99.5% for the cases 1, 2, 3, and 4.

Varying number of people
We also change the number of people walking in the field and analyze the fusion results. As shown in Fig. 9 (Cases 4, 5, and 6), the number of people walking in the field can affect the accuracy of our system. The accuracy of our system is 93.8% for four people walking randomly (Case 6). Figure 10c shows that when people walk in a crowd and cross each other, it is difficult to identify them by video only. But our system is still able to get correct pairing by taking advantage of sensor data.

Varying drone view angle and position
Further, we design different drone view angles to validate the robust of system under the following flight conditions: (i) the drone stays in the center position and continuously records videos for 1 min (Case 7), (ii) the drone flies within 37° angle range (Case 4), that is, the drone flies 5 m horizontally while maintaining a height of 5 m, and (iii) the drone flies within 67° angle range (Case 8), that is, the drone flies 10 m horizontally while maintaining a height of 5 m. Figure 10d shows some snapshots.

Conclusions
To conclude, this work presents a novel approach to fusing visual data from a drone camera and inertial data from wearable devices for person identification and tracking. We build a prototype to test the feasibility of the proposed system in a practical situation. To our best knowledge, this is the first work addressing tagging wearable IoT information on drone videos. We have tested changing position and view angle of a drone. Our approach requires no cumbersome data labeling process and does not rely on tracking human biological features. Our experiments show that even under occlusions or appearance changes, PID is still possible. The proposed PID and tracking system requires the authorization of accessing IoT data, and thus has potential applications in military training and construction site monitoring where people are required to wear badges. In the future, we would like to explore multiple intelligent IoT devices in larger surveillance regions with multiple drones under more complex environments.
provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.