1 Introduction

Replenishing stock with precise number is a key in medical business. To keep customers using and reordering the goods, on one hand, a salesperson needs to visit the hospital and count stock more often. On the other hand, traveling to hospitals in some areas is inconvenient. Even there is a central system that hospitals can fill in the usage to inform the company, sometimes they don’t report it after use immediately. In the COVID-19 situation, there are lots of new limitations as many rules need to be strict and followed by the hospital’s requests. According to information from a medical supply company, the lead time to replenish goods during COVID-19 is longer than a normal situation. This problem has many effects on the company such as losing the chance to sell goods to the hospital again or making them feel unpleasant. Solving this problem requires a system that can count quantities and detect lot numbers from images (taken by the hospital’s officer or salespeople) of the remaining stock, decreasing time to visit the hospital and allowing the sale company to acknowledge the remaining stock fast.

In this paper, our contribution is to use high accuracy text detection and recognition to precisely recognize the quantity and lot number of goods that remain in the hospital, helping salespeople to visit hospitals less often but more efficiently. We first process an input image to detect the word “LOT” enclosed by a rectangular box. The regions containing lot numbers will be relatively inferred from the positions of the detected “LOT” rectangles. After these regions with lot numbers are properly cropped out from the input image, Optical Character Recognition (OCR) is applied to read the lot numbers and output them as text sequence. Speaking of OCR for reading text from an image of textual document, previous works involve using OCR for historical documents [1] and automatic detection of books [2]. However, these mentioned works have different purposes from ours because our research purpose is to recognize the quantity and lot number of goods in the hospital context so that users may use our results to compare with the sale company’s database.

2 Related work

Apart from demand forecasting [3], stock or inventory counting is another important problem in Supply Chain Management (SCM). Because of each object’s uniqueness, previous solutions span widely from using a multi-robot system [4], using an automatic camera to record visual inventory for being counted manually later by human [5], or using vision-based template matching to locate and count target objects [6, 7]. Our work is different from these previous works as the counting is done on the stock of small medical products stored in a hospital’s stock room, requiring neither automatic robots nor cameras as in the large-scale inventory. Also, our goal is to detect and recognize lot numbers printed on each box package to count different lots of medical products accurately. The static pattern of how the lot number is printed on the box (as shown in Fig. 1) implies that it may not be necessary to apply general-purpose visual template matching techniques like [6, 7].

According to the survey of [8, 9], the solutions for obtaining text from natural images could be categorized into two ways. The first way is a step-wise method which consists of a series of processing steps including detection, segmentation, and recognition; the other way is an integrated method that combines several steps into a unified framework. In this paper, the first way of step-wise method is applied as we can design our processing steps and assemble several existing methods to suit requirements.

In fact, there are several previous works trying to recognize objects using computer vision, such as [10] hand motion recognition, [11] human motion analysis for recognition from 3D gait signatures, and [12] color recognition using a Bayesian classifier. To recognize lot numbers printed on medical product packages, logo or object detection is one possible solution. For a complicated image with lot of visual variants such as perspective distortion, multi-colored text, artistic font, uneven light or too much shadow etc., machine learning techniques and deep neural networks (a.k.a., deep learning) are said to be more efficient than handcrafted or rule-based techniques. Yufeng and Bo [13] proposed a solution to detect logos from bicycles using Haar Classifier and AdaBoost. Despite its high recognition rate, the precision is low compared to the local binary pattern algorithm. Since 2012, deep learning methods have gained a lot of attention according to [14]. Many works like [15, 16] used vision-based deep learning to obtain logo or text from an image with very promising results. Also, there is the work of [17] that compared the performance of several deep learning-based object detector’s architecture. This work concludes that Swin Transformer achieves the highest average precision (AP) with 57.70% followed by Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution (DetectoRS) (53.30%), EfficientDet-D2 (43.00%), and YOLOv4 (43.00%).

However, as a deep learning method often needs a huge dataset, it is not suitable for our medical stock situations where images are scarce. Unlike deep learning, traditional methods don’t require a lot of datasets and provide a good result in some situations. Kuznetsov and Savchenko [18] compared a custom deep learning algorithm based on DEep Local Features (DELF) and traditional methods (neither machine learning nor deep learning) such as scale-invariant feature transformation (SIFT), accelerated KAZE (AKAZE) and binary robust invariant scalable keypoints (BRISK) for a logo detection task. The results reveal that SIFT gets the most precision (0.89), BRISK gets the most recall (0.68), and AKAZE gets a good overall result (0.62 for precision and 0.64 for recall). In conclusion, our research work will experiment on rule-based techniques to detect lot numbers from images and input them to OCR.

3 Proposed method

In this work, extracting lot numbers from an input image is done by Algorithm 1. First, RotateImage is the process that rotates an input image to horizon orientation by using an orientation detected by Tesseract [19], one of the popular open-source OCR tools for converting an image into textual information. Then, in the rotated image, all squares are detected in FindSquares as shown in Fig. 2; this process includes converting the rotated image to grayscale, thresholding with Otsu, applying Morphology operation, doing Canny Edge, and finding contours. All detected squares are further classified into red (squares without the word “LOT”) or green (squares with the word “LOT”) squares in FilterCandidate as shown in Fig. 3. To get lot numbers written next to the detected green squares, each green square is extended (stretched) to cover its corresponding lot number as shown in Fig. 4 (FindTextRegionWithLotNumber), then the extended square is cropped out from the image and straightened up using image warping. Finally, Tesseract OCR is applied to each resultant region and the output lot numbers are read as shown in Fig. 5.

figure a

4 Experimental results and discussion

To evaluate our prototype system in real production, we host our system on the Amazon Web Services (AWS) cloud so that users can use the system by accessing our website. In the website, users can import an input image. After finishing all processes, the result will be displayed, allowing users to export the results as a CSV (Comma-Separated Value) file for their further use in the stock counting database. Example results are shown in Fig. 6.

Fig. 1
figure 1

An example of input image where our goal is to detect and recognize all lot numbers written next to the “LOT” rectangles

Fig. 2
figure 2

Text localization after rotating and found all candidates contours

Fig. 3
figure 3

The filtering process, to separate candidates with and without word LOT in two different colors

Fig. 4
figure 4

Text segmentation will get only square with word LOT in it and include lot number

Fig. 5
figure 5

Text recognition will use Tesseract library to convert image to text

Fig. 6
figure 6

All processes will be demonstrated on our website hosted in the AWS cloud. The results of lot numbers will show in summarized text and can be exported to a CSV file

The proposed system was evaluated with 43 images containing 240 lot numbers. The overall accuracy was 84.17%. In addition, we divided the evaluation into two parts. (1) Detecting word “LOT”: We calculated accuracy in this part by dividing the number of detected words “LOT” by the total number of all words “LOT.” (2) Detecting lot numbers: In this part, we measured accuracy on a word level. For example, if there was one character in lot number which was wrongly recognized, we defined that lot number as a false prediction. We decided to use this approach because the correct lot numbers should be found to be able to match lot numbers in the database. Accuracy results of our proposed methods are 91.67% for detecting word “LOT,” 91.82% for detecting lot numbers, and 84.17% for the overall.

After evaluation, we could categorized the errors into four types. First, the inability to detect some contours that contain the “LOT” text. Because we currently use the RETR_EXTERNAL find-contour mode in OpenCV (Open Source Computer Vision), the program will ignore any inner contours, causing the program to miss detecting some ROIs with the “LOT” text. We tried experimenting with other find-contour modes like RETR_LIST or RETR_CCOMP to include all contours but it resulted in too many contours detected; this made the program 9-10x slower than the RETR_EXTERNAL mode despite slight improvement in accuracy. To further investigate this error, we have to reduce the number of ROIs to reduce the number of times that the program needs to check whether each ROI contains the text “LOT.” For example, using the area threshold to filter out regions that are too big or too small, trying a binarization method designed specially for document image [20], or applying image denoising for better character recognition [21].

The second type of error is when OCR cannot read some “LOT” text. This error occurs after we get a list of ROIs, warp them, and feed the warped ROIs to Tesseract. It came out that Tesseract can’t read some “LOT” text properly. The third type of error is when Tesseract OCR cannot detect the proper text orientation regarding some images. This error occurs when an image contains few textual information so that Tesseract cannot detect a proper textual orientation. The fourth type of error is when Tesseract OCR cannot read the lot number accurately. This is caused by the low quality of input images such as low resolution, noise, and poor lighting conditions. To resolve this, we may have to improve our preprocessing methods to enhance input images.

For the last three types of errors (types 2, 3, and 4), trying alternative OCR engines may help resolve or ease the errors. According to [22], the top performing OCR engines are Google Cloud Vision and AWS Textract. However, this study uses Tesseract because it is a common and public tool. For the error of detecting the word “LOT,” we found that they were caused by an inability to detect some contours that contain text (70%) and an inability to read text “LOT” by OCR (30%). Hence, improving the contour detection method with a more robust solution, like deep learning-based object detection or text-specific contour detection method [23], may significantly ease this error. As for the error in detecting the lot numbers, we could not discover any specific error pattern; for example, W \(\rightarrow\) V?? (1 occurrence), W \(\rightarrow\) VWW (1 occurrence), W \(\rightarrow\) \(\backslash\)A (1 occurrence) and W \(\rightarrow\) \(\backslash\)AJ (2 occurrences). Trying alternative OCR engines may help clarify this.

5 Conclusion and future work

This paper presents an approach for text detection and text recognition in a specific use case of medical stock counting. Our prototype system achieves 84.17% in the overall accuracy. However, some lot numbers still cannot be detected accurately due to the low quality of images. Our future work would focus on improving algorithms to achieve higher accuracy and faster computation. One interesting alternative is to detect unstructured text from an input image using deep learning-based techniques for information extraction like Named-Entity Recognition (NER). Another future direction is to develop a GUI (Graphic User Interface) application and design a complete workflow for actual production deployment.