Benchmarking YOLOv5 and YOLOv7 models with DeepSORT for droplet tracking applications

Tracking droplets in microfluidics is a challenging task. The difficulty arises in choosing a tool to analyze general microfluidic videos to infer physical quantities. The state-of-the-art object detector algorithm You Only Look Once (YOLO) and the object tracking algorithm Simple Online and Realtime Tracking with a Deep Association Metric (DeepSORT) are customizable for droplet identification and tracking. The customization includes training YOLO and DeepSORT networks to identify and track the objects of interest. We trained several YOLOv5 and YOLOv7 models and the DeepSORT network for droplet identification and tracking from microfluidic experimental videos. We compare the performance of the droplet tracking applications with YOLOv5 and YOLOv7 in terms of training time and time to analyze a given video across various hardware configurations. Despite the latest YOLOv7 being 10% faster, the real-time tracking is only achieved by lighter YOLO models on RTX 3070 Ti GPU machine due to additional significant droplet tracking costs arising from the DeepSORT algorithm. This work is a benchmark study for the YOLOv5 and YOLOv7 networks with DeepSORT in terms of the training time and inference time for a custom dataset of microfluidic droplets.


I. INTRODUCTION
A subset of machine learning-based tools, called computer vision tools, deal with object identification, classification, and tracking in images or videos.State-of-the-art computer vision tools can read handwritten text [1][2][3][4] , find objects in images [5][6][7][8] , find product defects 9,10 , make a medical diagnosis from medical images with accuracy surpassing humans 11,12 and object tracking 13,14 , just to name a few.In the last few years, they have been increasingly consolidating their place in all scientific fields and industries as reliable and fast analysis methods.
Computer vision tools have shown remarkable success in studying microfluidic systems.Artificial neural networks, for example, can predict physical observables, such as flow rate, chemical composition, etc., from images of microfluidics systems with high accuracy, thus reducing hardware requirements to measure these quantities in an microfluidics experiment 15,16 .More recently, a convolutional autoencoder model was trained to predict stable vs unstable droplets from their shapes within a concentrated emulsion 17 .
Another application of computer vision tools in microfluidics is tracking droplets in experiments and simulation studies, such as the ones in Ref. [18][19][20][21] .Droplet recognition and tracking can yield rich information without needing human intervention.For example, counting droplet numbers, measuring flow rate, observing droplets size distribution and computing statistical quantities are cumbersome to measure with the manual marking of the droplets across several frames.Two natural questions, while using computer vision tools for image analysis, are i) how accurate the application is in terms of finding and tracking the objects, and ii) how fast the application is in analyzing each image.A typical digital camera operates at 30 frames per second (fps), thus one challenge is to analyze the images at the same or higher rate for real-time applications.
Along with a few other algorithms, You Only Look Once (YOLO) has the capability to analyze images at a few hundred frames per second 22,23 and is designed to detect 80 classes of objects in a given image.The very first version of YOLO was introduced back in 2015 and the subsequent versions have been focused on making the algorithm faster and more accurate at detecting objects.The latest release of YOLO is its 7th version 24 , with a reported significant gain in speed and accuracy for object detection in standard datasets containing several objects in realistic scenes.In our previous study, we trained YOLO version 5 and DeepSORT for real-time droplet identification and tracking in microfluidic experiments and simulations 25,26 , and we reported the image analysis speed for various YOLOv5 models.In this one, we train the latest YOLOv7 models along with DeepSORT and compare performance and image analysis speed of these models with the previous one.
In particular, this paper studies and compares training time, droplet detection accuracy and inference time for an application that combines YOLOv5/YOLOv7 with DeepSORT for droplet recognition and tracking.

II. EXPERIMENTAL METHODS
The images analyzed in this study were obtained from a microfluidic device for the generation of droplets exploiting a flow-focusing configuration (scheme of the device in Fig. 1).
The device has two inlets for oil flow, one inlet for the flow of an aqueous solution, a Yshaped junction for droplet generation and an expansion channel.The latter is connected to an outlet for collecting the two-phase emulsion.The device was realized by using a stereolithography system (Enviontec, Micro Plus HD) and the E-shell® 600 (Envisiontec) as pre-polymer.The continuous phase consists of silicone oil (Sigma Aldrich, oil viscosity 350 cSt at 25°C), while an aqueous solution constitutes the dispersed phase.The latter was made by dissolving 7 mg of a black pigment (Sigma Aldrich, Brilliant Black BN) in 1 mL of distilled water.Both phases were injected through the inlets at constant flow rates by a programmable syringe pump with two independent channels (Harvard Apparatus, model 33).The images analyzed in this study were obtained by using a flow rate of 10 µl/min and 150 µl/min for the dispersed phase and the continuous phase, respectively.The droplet formation is imaged by using a stereomicroscope (Leica, MZ 16 FA) and a camera (Photron, fastcam APX RS).The fast camera acquired the images at 3000 frames per second (fps).This image capture rate is far higher than any present algorithm's real-time object detection capabilities.The image playback rate is to 30 fps.The sequences of images were stored as AVI video files.Later, images from the video were used to train YOLO and DeepSORT models as described in the following section.

III. TRAINING YOLOV5 AND YOLOV7 MODELS
The steps required to train YOLOv5 and YOLOv7 are identical.First, a training dataset is prepared by manually annotating 1000 images taken from a microfluidics experiment as described in Sec II.Each image in this dataset has approximately 13 to 14 droplets.One example from the training dataset is shown in Fig. 2. The droplets in these images are identified, and the dimensions of a rectangle that fully covers the droplet are noted in a separate text file called the label file.We used PyTorch implementation of YOLOv5 27 and YOLOv7 28 to train several YOLO models on an HPC system on a single node containing  During the training phase, the quality of YOLOv5 and YOLOv7 models is measured with a well-known mean average precision (mAP), which is calculated with an Intersection over Union (IoU) threshold of 0.5 (see Fig. 4).For both versions, mAP value quickly saturates to unity after training with 20 epochs.Similarly, the average of mAP calculated with IoU threshold of 0.5 to 0.95 in steps of 0.05 shows no significant differences between YOLOv5 and YOLOv7 models.

IV. INFERENCE WITH YOLO AND DEEPSORT
After the models are trained, they can be deployed for real-world applications.One challenging milestone for any computer vision application is to use it in real time, i.e. when the image analysis speed exceeds 30 fps.YOLO models on their own do deliver real-time performance.In table II and III, we show the total time for droplet identification and tracking, combining YOLOv5/YOLOv7 with DeepSORT on two hardware configurations.
The benchmarking study was carried out on an MSI G77 Stealth laptop with i7-12700H, 32 GB RAM, and NVIDIA RTX 3070 Ti 8 GB VRAM GPU.Running on GPU, we observe approximately 10% improvement in the inference speed for YOLOv7 over YOLOv5.We observe a significant increase in inference speed in YOLOv7 models compared to their YOLOv5 counterparts, as one would expect.Moreover, we report detailed computational costs on object detection and object tracking routines and the overall performance of the combined application.Lighter YOLO models are much quicker to identify objects in comparison with the time taken by DeepSORT to track them.However, the object identification time increases with the increasing complexity of the object-detecting networks.Thus, it is crucial to choose the right YOLO network and hardware configuration for real-time tracking at the cost of the bounding box accuracy.
Tracking droplets in microfluidics is a challenging task.The difficulty arises in choosing a tool to analyze general microfluidic videos to infer physical quantities.The state-of-the-art object detector algorithm You Only Look Once (YOLO) and the object tracking algorithm Simple Online and Realtime Tracking with a Deep Association Metric (DeepSORT) are customizable for droplet identification and tracking.The customization includes training YOLO and DeepSORT networks to identify and track the objects of interest.We trained several YOLOv5 and YOLOv7 models and the DeepSORT network for droplet identification and tracking from microfluidic experimental videos.We compare the performance of the droplet tracking applications with YOLOv5 and YOLOv7 in terms of training time and time to analyze a given video across various hardware configurations.Despite the latest YOLOv7 being 10% faster, the real-time tracking is only achieved by lighter YOLO models on RTX 3070 Ti GPU machine due to additional significant droplet tracking costs arising from the DeepSORT algorithm.This work is a benchmark study for the YOLOv5 and YOLOv7 networks with DeepSORT in terms of the training time and inference time for a custom dataset of microfluidic droplets.a) Electronic mail: mihir.durve@iit.it

FIG. 1 :
FIG. 1: Schematic representation of the microfluidic device used for the droplet generation.
FIG. 3: Loss function during the YOLOv5 and YOLOv7 as the training progress.Figure legends are the same as in Fig. 4.

TABLE I :
24LO models training time on the same machine with an identical training dataset.The YOLO model descriptions can be found in Ref.27for v5 and in Ref.24for v7

TABLE II :
Inference time per frame -CPU

TABLE III :
Inference time per frame -GPU