Abstract
Detecting text within medical images presents a formidable challenge in the domain of computer vision due to the intricate nature of textual backgrounds, the dense text concentration, and the possible existence of extreme aspect ratios. This paper introduces an effective and precise text detection system tailored to address these challenges. The system incorporates an optimized segmentation module, a trainable post-processing method, and leverages a vision-language pre-training model (oCLIP). Specifically, our segmentation head integrates three essential components: the Feature Pyramid Network (FPN) module, which combines a residual structure and channel attention mechanism; the Efficient Feature Enhancement Module (EFEM); and the Multi-Scale Feature Fusion with RSEConv (MSFM-RSE), designed specifically for multi-scale feature fusion based on RSEConv. By introducing a residual structure and channel attention mechanism into the FPN module, the convolutional layers are replaced with RSEConv layers that employ a channel attention mechanism, further augmenting the representational capacity of the feature maps. The EFEM, designed as a cascaded U-shaped module, incorporates a spatial attention mechanism to introduce multi-level information, thereby enhancing segmentation performance. Subsequently, the MSFM-RSE adeptly amalgamates features from various depths and scales of the EFEM to generate comprehensive final features tailored for segmentation purposes. Additionally, a post-processing module employs a differentiable binarization strategy, allowing the segmentation network to dynamically determine the binarization threshold. Building on the system’s improvement, we introduce a vision-language pre-training model that undergoes extensive training on various visual language understanding tasks. This pre-trained model acquires detailed visual and semantic representations, further reinforcing both the accuracy and robustness in text detection when integrated with the segmentation module. The performance of our proposed model was evaluated through experiments on medical text image datasets, demonstrating excellent results. Multiple benchmark experiments validate its superior performance in comparison to existing methods. Code is available at: https://github.com/csworkcode/VLDBNet.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
The rapid advancement in computer technology has significantly transformed information accessibility, resulting in an exponential surge of digital content. Textual information assumes a pivotal role in comprehending visual data, thereby emphasizing the critical nature of text detection in images within the domain of computer vision, offering a wide array of applications. Notably, recent years have witnessed significant progress driven by the rapid evolution of Convolutional Neural Networks (CNNs) [5, 12, 43]. These applications span from text recognition and license plate identification to autonomous driving, digital archiving, and intelligent healthcare. Particularly significant is the precise and efficient extraction of textual information from medical images, bearing profound practical implications. Text detection, which targets the localization of text instances within images, stands as an indispensable component in text reading. Despite substantial progress in recent years, text detection remains challenging due to the varied scales, irregular shapes, and extreme aspect ratios of the text instances.
The field of text detection has witnessed notable progress due to recent advancements in CNN-based object detection [7, 9, 11] and segmentation techniques [8, 58]. These advancements have significantly enhanced the detection of scene text [27, 28, 32, 41, 45, 46] in general applications. However, specific text environments present unique challenges that necessitate further attention. For instance, images with dense text distribution, diverse content, and irregular shapes often pose difficulties for existing algorithms in accurately locating and segmenting all text instances. Medical images encompass a variety of healthcare records, including outpatient invoices, hospital invoices, drug tax receipts, and discharge summaries. Detecting text within medical images poses distinct challenges that require tailored approaches. Challenges, such as skewed text caused by non-flat surfaces during image scanning or capturing, dense text distribution, incomplete boundary detection due to the presence of different languages and symbols, and the introduction of irrelevant noise from external environmental factors, must be addressed.
To address these challenges, there has been extensive research into the joint learning of visual and textual representations within numerous Vision-Language Pre-training (VLP) techniques. This development significantly advances various Vision-Language (VL) tasks, including Visual Question Answering (VQA), Image–Text Retrieval, among others. Given its linguistic nature, Optical Character Recognition (OCR) inherently stands to benefit from the implementation of these VLP techniques [2, 18, 37]. The OCR Contrastive Language-Image Pre-training (oCLIP) [49] technique leverages textual information to learn effective visual text representations, aiming to enhance scene text detection and spotting.
While conducting this study, it is pertinent to note that our previous work titled ’Multi-level Feature Enhancement Method For Medical Text Detection’ has been accepted for publication but is pending formal publication. The work presented in this paper serves as an extension and further exploration of the methodologies introduced in the aforementioned forthcoming paper. In this study, we extend the research presented in our previous work from two distinct perspectives. First, we enhanced the FPN network by introducing residual structures and channel attention mechanisms, further augmenting the representational capability of the feature maps. Subsequently, we conducted a more comprehensive experimental analysis of the proposed text detection network, evaluating the impact of the number of cascaded EFEM modules on the detection results.
In summary, our contributions are threefold:
-
We have introduced two highly efficient modules, the Efficient Feature Enhancement Module (EFEM) and the Multi-Scale Feature Fusion Module with RSEConv (MSFM-RSE), based on residual structures and spatial channel attention mechanisms. These efficient modules significantly bolster the network’s feature representation.
-
By leveraging vision-language pre-training models obtained through large-scale visual language understanding tasks, our system enhances the representation capabilities, leading to improved accuracy and robustness in text detection.
-
Using the proposed method, we achieved competitive results in terms of efficiency, accuracy, F-score, and robustness across five publicly available scene text detection datasets. Particularly, our method demonstrated superior performance on medical image text datasets, outperforming others in this domain.
The rest paper is organized as follows: Sect. 2 reviews the relevant text detection methods. Our proposed approach is described in Sect. 3. The experiments are discussed and analyzed in Sect. 4. The conclusions are summarized in Sect. 5.
Related works
Text detection
In recent years, scene text detection has witnessed significant advancements due to the rapid progress in deep learning techniques. The field mainly explores two prominent approaches: segmentation-based methods and regression-based methods, extensively researched for their capacity to represent text instances with diverse shapes.
Regression-based text detectors
Regression-based methods for text detection are often influenced by general object detection frameworks like Faster R-CNN [38] and SSD [29]. TextBoxes [22] directly adjusts the anchor scales and convolution kernel shapes of SSD to manage text with extreme aspect ratios. TextBoxes++ [23] further refines by regressing quadrangles instead of horizontal bounding boxes for the detection of multi-oriented text. EAST [59] and Deep-Reg [16] are anchor-free methods employing pixel-level regression to detect multi-oriented text instances. RRPN [36] employs Faster R-CNN and innovates the rotation proposals in the RPN part to detect arbitrarily oriented text. RRD [25] extracts feature maps for text classification and regression using two separate branches to enhance long text detection. These methods have shown promising results in various text detection tasks. However, accurately representing irregular shapes, such as curved text, with precise bounding boxes remains a challenge for regression-based methods.
Segmentation-based text detectors
Segmentation-based methods draw significant inspiration from fully convolutional networks (FCN) [31]. Zhang et al. [57] were the first to employ FCN for extracting text blocks and detecting character candidates through MSER. Mask TextSpotter [34] detects arbitrary-shaped text instances following an instance segmentation approach derived from Mask R-CNN [13]. PSENet [45] introduced a progressive scale expansion technique by segmenting text instances using kernels of varying scales. This approach allows for better handling of text instances with varying sizes and aspect ratios. Tian et al. [42] proposed pixel embedding, which groups pixels based on the segmentation results, enabling more accurate text localization. Post-processing algorithms, such as geometric verification and text-line construction, are often applied to refine the segmentation results. While these methods achieve high accuracy, they may suffer from reduced inference speed due to the complexity of the post-processing steps.
Vision-language pre-training
In recent years, the integration of vision-language pre-training models [49] has shown great promise in various computer vision tasks, including text detection. Vision-language pre-training involves training models on large-scale visual language understanding tasks, enabling them to acquire rich visual and semantic representations. ViLBERT [33] and LXMERT [40] introduce a two-stream framework incorporating a vision-language co-attention module for cross-modal feature fusion. Conversely, VisualBERT [21], Unicoder-VL [20], and VL-BERT [39] adopt a single-stream framework. These pre-training models capture the contextual information and semantic understanding necessary for accurate text detection. By leveraging the power of vision-language pre-training, text detection systems can benefit from enhanced representation capabilities, leading to improved accuracy and robustness.
Our proposed approach exhibits superior performance in both speed and the detection of medical text and irregular shapes compared to existing rapid scene text detectors. Integrating an efficient segmentation module that merges the strengths of regression and segmentation-based methods, our method also incorporates a learnable post-processing technique, uniting traditional post-processing advantages with the efficiency of deep learning. Additionally, through the utilization of vision-language pre-training models, our approach attains heightened accuracy and robustness in text detection. To assess our approach, we conducted comprehensive experiments on diverse benchmark datasets, including challenging medical text datasets and datasets featuring arbitrary-shaped text. Results showcase our method’s outperformance of existing state-of-the-art text detectors in accuracy, speed, and robustness. This proposed approach displays significant potential for practical applications in medical image analysis, document processing, and intelligent information retrieval.
Methodology
Overall network architecture
The architecture of our proposed model, depicted in Fig. 1, integrates a vision-language pre-training model alongside four primary components: a feature extraction backbone, a feature enhancement backbone, a multi-scale feature fusion model, and a post-processing procedure. During the initial stage, the input image undergoes processing by the vision-language pre-training backbone, previously trained on comprehensive visual language understanding tasks at a large scale. We initially pre-train the vision-language model on SynthText, incorporating full annotations, and then transfer the backbone weights to fine-tune our proposed model using real datasets. This backbone acquires intricate visual and semantic representations, serving as a robust basis for subsequent stages (refer to Fig. 1a, b). The subsequent step involves feeding the input image into an FPN-RESConv structure to acquire multi-level features and simultaneously adjust the channel number of the feature maps (refer to Fig. 1c, d). Second, the pyramid features are up-sampled to a uniform scale and then fed into the Efficient Feature Enhancement Module (EFEM). The EFEM module, structured in a cascaded manner, provides the benefit of low computational cost. Positioned behind the backbone network, it seamlessly integrates to augment and enhance the expressive capability of features across various scales (as illustrated in Fig. 1e, f). Following this, we integrate the MSFM-RSE to efficiently fuse the features produced by the EFEMs at different depths, resulting in a holistic final feature representation for segmentation purposes (depicted in Fig. 1g). Following this step, the extracted feature F is employed to forecast both the probability map (P) and the threshold map (T). Subsequently, the approximate binary map (\({{\hat{B}}}\)) is computed by utilizing the probability map and feature F (as depicted in Fig. 1h–k). Throughout the training phase, supervision is applied to the probability map, threshold map, and approximate binary map. Remarkably, the probability map and approximate binary map undergo identical supervision. During the inference phase, acquiring the bounding boxes is seamless, achievable from either the approximate binary map or the probability map through a dedicated box formulation module.
Efficient feature enhancement module
The EFEM, illustrated as a U-shaped module (see Fig. 2), is instrumental in augmenting features of varied scales within our proposed text detection system. It comprises three crucial phases working in unison to achieve optimal performance.
In the up-scale enhancement phase, the feature maps undergo iterative refinement at intervals of 32, 16, 8, and 4 pixels. This iterative process ensures gradual improvement and enrichment of each scale’s features with more intricate information. It enables capturing fine-grained details, thereby enhancing the feature discriminative capabilities.
A fundamental component within the EFEM is the Pyramid Squeeze Attention (PSA) module [54], functioning as an efficient attention block. This module adeptly extracts multi-scale spatial information, enabling a nuanced understanding of the intricate relationships between various feature map regions. By establishing extensive channel dependencies, it captures both global context and local details, significantly boosting the features’ discriminative strength and overall text detection accuracy.
In the down-scale phase, the EFEM leverages the feature pyramid generated during up-scale enhancement. Starting from a 4-stride, this iterative process continues until it reaches a 32-stride. This down-scale enhancement further refines the features by amalgamating multi-level details from lower to higher scales. By amalgamating low-level specifics with high-level semantic data, the EFEM ensures that the final feature representation is comprehensive, informative, and proficient in capturing diverse text instance characteristics.
Compared to existing methods, the EFEM offers two key advantages. First, its ability to be cascaded multiple times for feature fusion and expanded receptive fields allows for effective handling of text instances of various sizes and aspect ratios, enhancing adaptability to diverse text scenarios. Second, its computational efficiency makes it suitable for real-world applications with limited resources. This efficiency ensures real-time functionality, facilitating practical deployment in various scenarios.
Multi-scale feature fusion with RSEConv
The ’Multi-Scale Feature Fusion with RSEConv’ (MSFM-RSE) serves as a pivotal module in our proposed text detection system. It is meticulously crafted to amalgamate features extracted from diverse scales. By employing RSEConv, a specialized convolutional layer incorporating channel attention mechanisms and residual architecture, this module adeptly harmonizes high-level, multi-scale feature representations derived from the Efficient Feature Enhancement Module (EFEM). The MSFM-RSE efficiently merges and refines these features, creating a comprehensive, multi-scale feature representation ideal for text detection. This enhancement significantly augments the system’s capability to identify diverse text characteristics across different complexities and scales. Figure 3 illustrates the schematic diagram of RSEConv.
Through the integration of RSEConv layers, the MSFM-RSE skillfully merges high-level features produced by the EFEM, ensuring a cohesive and informative feature set that substantially strengthens the system’s capability to discern complex text instances across various scales. This fusion, facilitated by the residual structures and channel attention mechanisms, optimizes feature refinement for accurate text detection across different complexities and scales, showcasing the system’s adaptability in addressing multifaceted challenges encountered in text detection across diverse contexts.
Figure 4 illustrates the schematic diagram of this module. Specifically, the process begins with an element-wise addition to merge the feature maps from corresponding scales. This step enables the combination of local details and global context, effectively capturing both fine-grained information and overall scene understanding. Subsequently, the resulting output is processed through RESConv, simultaneously adjusting the channel number of the feature maps. Following this, the resulting feature maps undergo upsampling and concatenation, culminating in the creation of a comprehensive final feature map.
Deformable convolution and label generation
In our approach, we exploit the advantages of deformable convolutions [61] to improve the model’s receptive field, particularly beneficial for text instances with extreme aspect ratios. Following the methodology detailed in [62], we integrate modulated deformable convolutions into all the 3\(\times \)3 convolutional layers across the conv3, conv4, and conv5 stages of the ResNet-50 backbone.
Deformable convolution offers an adaptable and adjustable receptive field, enabling the model to effectively detect text instances of diverse shapes and sizes. By integrating deformable convolutions into the network architecture, the model can dynamically modify its receptive field, focusing on the most pertinent regions within the input image. This adaptable mechanism is particularly beneficial for text detection, as it assists the model in handling the complexities presented by text instances with extreme aspect ratios. The incorporation of modulated deformable convolutions within the ResNet-50 backbone strengthens the model’s capacity to capture intricate details, efficiently representing the intricate structures of text instances. This inclusion further enhances the overall performance and resilience of our text detection system, enabling accurate detection of text instances with varying aspect ratios and challenging visual attributes.
The label generation for the probability map is influenced by PSENet [45]. Typically, post-processing algorithms display the segmentation results using a collection of vertices that define a polygon
n represents the number of vertices, which typically varies depending on the labeling rules in different datasets, and S denotes the segmentation results for each image. G represents the symbolic notation for a set of line segments that describe the text regions in an image and S is a symbol for segmentation results in each image. Mathematically, the offset D can be calculated as follows:
In this context, Area(\(\cdot \)) refers to the calculation of the polygon’s area, and Perimeter(\(\cdot \)) denotes the calculation of the polygon’s perimeter. P refers to the original polygonal text region. The shrink ratio, denoted as "r," is empirically set to 0.4. By employing graphics-related operations, the shrunken polygons can be derived from the original ground truth, serving as the fundamental building block for each text region. During the inference phase, we have the option to utilize either the probability map or the approximate binary map to generate text bounding boxes, which yield nearly identical results. To enhance efficiency, we opt for the probability map, allowing us to eliminate the threshold branch. The process of forming the bounding boxes involves three steps: 1. The probability map or the approximate binary map is initially binarized using a fixed threshold (0.2) to produce the binary map. 2. Connected regions, representing the shrunken text regions, are identified based on the binary map. 3. The shrunken regions are then expanded or dilated using an offset \(D'\). The calculation of \(D'\), is as follows:
where \(A'\) is the area of the shrunk polygon; \(L'\) is the perimeter of the shrunk polygon; \(r'\) is set to 1.5 empirically.
Experimental results
Datasets
In addition to the medical text dataset, our experiments encompass several prevalent public scene text detection datasets, including SynthText, Total-Text, CTW1500, ICDA R2015, and MSRA-TD500. These datasets serve to evaluate the robustness and generalization abilities of our proposed model. Through assessment across diverse datasets, our goal is to validate the model’s performance in various text detection scenarios, showcasing its effectiveness in handling an array of text types and shapes, extending beyond medical text.
MEBI-2000 dataset is derived from the public dataset of Ali Tianchi competition [4]. Specifically created for medical ticket text detection, it includes discharge summaries, prescription invoices, outpatient invoices, and inpatient invoices. Representing scenarios encountered in medical insurance, this dataset includes challenges like mixed text and images, camera shake, skewed images, dense text, and table data. Privacy is maintained through anonymization. MEBI-2000 serves as a practical dataset for medical insurance image analysis.
SynthText dataset [10] is a synthetic dataset comprising over 800k synthetic images. It is exclusively utilized for pre-training our model.
Total-Text dataset [3] comprises 1255 training images and 300 testing images, emphasizing curved text. This dataset covers various text types, including horizontal, multi-oriented, and curved text, and includes word-level annotations, enabling precise evaluation and analysis.
CTW1500 dataset [52] includes 1000 training images and 500 testing images, focusing on curved text. The dataset contains both English and Chinese texts, annotated at the text-line level using polygonal shapes. It serves as a valuable resource for developing and evaluating algorithms designed for curved text detection and recognition tasks.
ICDAR 2015 dataset [19] is a frequently used dataset for text detection, containing 1000 training images and 500 testing images captured using Google Glass. It includes text instances that are often heavily distorted or blurred, annotated by their quadrangle’s four vertices.
MSRA-TD500 dataset [50] includes 300 training images and 200 test images with text-line level annotations. It is a dataset with multi-lingual, arbitrary-oriented and long text lines. Because the training set is rather small, we follow the previous works [32, 35, 60] to include the 400 images from HUST-TR400 [51] as training data.
Implementation details
We initially pre-train the vision-language model on SynthText, incorporating full annotations, and then transfer the backbone weights to fine-tune our proposed model using real datasets (MEBI, Total-Text, CTW1500, ICDAR2015, and MSRA-TD500). We proceed with fine-tuning the models for 1200 epochs on the corresponding real-world datasets. During training, our primary data augmentation techniques encompass random rotation, random cropping, and random horizontal and vertical flipping. Additionally, we resize all images to 640 \(\times \) 640 to enhance training efficiency. For all datasets, the training batch size is set to 16, and we adhere to a "poly" learning rate policy to facilitate gradual decay of the learning rate. Initially, the learning rate is set to 0.007, accompanied by an attenuation coefficient of 0.9. Our framework employs stochastic gradient descent (SGD) as the optimization algorithm, with weight decay and momentum values set to 0.0001 and 0.9, respectively.
Ablation study
To illustrate the effectiveness of pivotal modules, specifically the deformable convolution, Efficient Feature Enhancement Module, and Multi-Scale Feature Fusion Module with RSEConv, we conducted an ablation study using the MEBI dataset and the ICDAR2015 dataset. Detailed insights into each module’s performance are presented in Table 1. This evaluation allows an assessment of the individual modules’ contributions to the entire text detection system, validating their role in enhancing accuracy and robustness.
The efficacy of Deformable Convolution is evident from the outcomes displayed in Table 1. For the ICDAR2015 dataset, integrating deformable convolution yields a noteworthy 3.4% increase in the F-measure. Correspondingly, in the case of the MEBI dataset, deformable convolution contributes to an improvement of 1.8%. These results underscore how deformable convolution endows the backbone network with an adaptable receptive field, facilitating the effective capture of spatial relationships and enhancing the performance of the text detection system. Significantly, the incurred additional computational cost is relatively marginal, rendering it a practical and efficient enhancement for the model.
The effectiveness of the EFEM is prominently displayed in Table 1. In the case of the ICDAR2015 dataset, the EFEM substantially boosts the F-measure by 5.7%. Furthermore, when utilized alongside deformable convolution (EFEM+DConv), the F-measure enhancement reaches 7.8%. Similarly, for the MEBI dataset, the EFEM demonstrates a 6.6% increase in the F-measure, while the combined EFEM+DConv indicates an even higher improvement of 9.5%. Although this combination might experience a slight reduction in inference speed, the notable enhancements in the F-measure unequivocally portray the EFEM’s improved accuracy in detecting text instances.
The effectiveness of MSFM-RSE is evident from the outcomes in Table 1. For the ICDAR2015 dataset, the MSFM-RSE contributes to a 2.8% enhancement in the F-measure. When utilized alongside deformable convolution (MSFM-RSE+DConv), this improvement increases to 5.4%. Similarly, in the MEBI dataset, the MSFM-RSE demonstrates its effectiveness by raising the F-measure by 5%. Moreover, the combined MSFM-RSE+DConv approach achieves an even more notable increase of 7.9% in the F-measure. While there might be a slight reduction in inference speed, the remarkable improvements in the F-measure highlight the MSFM-RSE’s superior performance in accurately detecting text instances.
The influence of the number of cascaded EFEMS We analyzed the impact of the number of cascaded EFEMs by varying n from 0 to 4 in Table 2. It was observed that the F-measures on the test sets consistently increased with the growth of n until it began to plateau when n reached a value of 2 or higher. However, it is notable that a larger n could potentially slow down the model, despite the low computational cost of the EFEM. To maintain an optimal balance between performance and speed, n has been set to 2 as the default value for subsequent experiments.
Comparisons with previous methods
We assess the performance of our proposed method by conducting a comparative analysis against previous approaches on five established benchmarks. These benchmarks encompass various text scenarios, including medical text, curved text, multi-oriented text, and long text lines in multiple languages. For a comprehensive evaluation, we employ both quantitative and qualitative methods. Quantitative results are presented in terms of benchmark scores, while qualitative findings are visualized in Figs. 5, 6, and 7. This comprehensive analysis allows us to showcase the superior performance of our proposed method and its efficacy in effectively handling diverse text scenarios
Medical text detection The MEBI dataset is intricately designed for medical text, encompassing a diverse array of textual instances with varying scales, irregular shapes, and extreme aspect ratios. In Table 3, we offer a comprehensive comparison of our model against previous methodologies, showcasing its leading-edge performance in terms of accuracy, F-measure, recall, and frames per second (FPS). Surpassing existing approaches, our model further establishes its superiority in dealing with the intricacies of medical text detection. Figures 5 and 6 visually demonstrates our model’s performance, featuring visualizations of medical text instances extracted from the MEBI dataset. These visuals exhibit the model’s adeptness in accurately detecting and segmenting text within challenging medical images, underscoring its robustness and efficacy.
To verify the robustness and adaptability of our model, we conducted experiments on publicly available scene text datasets. Figure 8 displays qualitative results from these supplementary experiments, depicting our model’s proficiency in addressing various text scenarios beyond the medical domain. These outcomes further emphasize the flexibility and efficacy of our model across diverse text detection tasks.
Curved text detection The performance of our approach in scene text detection is outlined in Tables 4 and 5, focusing on two curved text datasets: Total-Text and CTW1500. These results underscore our method’s outstanding performance, particularly in terms of the F-measure metric, signifying the overall accuracy of text detection. In comparison to the DBNet++ method, recognized for its strong performance on these datasets, our method attains similar performance levels while notably reducing the inference time. Notably, our approach achieves equivalent or even superior F-measure scores, demanding only about 80% of the time taken by DBNet++. This comparison accentuates the efficiency and effectiveness of our proposed approach in addressing curved text instances. Our method not only achieves precise text detection but also exhibits superior efficiency, rendering it a highly competitive solution for curved text detection tasks.
Multi-oriented and Multi-oriented text detection It is essential to highlight the challenges posed by the ICDAR 2015 dataset, characterized by various obstacles like multi-oriented text, small text instances, and low-resolution images. Despite these difficulties, our method displayed remarkable performance, surpassing the established DBNet++ as detailed in Table 6. Demonstrating a substantial 2.6% improvement in F-measure compared to DBNet++, our method confirms its efficiency in addressing the complexities of the dataset. Similarly, Table 7 showcases our method’s comparable or even superior performance in F-measure when compared to DBNet++ on the MSRA-TD500 dataset. These results underscore the favorable equilibrium our method establishes between detection performance and inference speed, rendering it an attractive option for practical applications. In summary, our method excels in tackling the challenges of the ICDAR 2015 dataset, showcasing enhanced performance in comparison to DBNet++. Furthermore, our method exhibits competitive performance on the MSRA-TD500 dataset, coupled with faster inference speed. These findings affirm the potential of our approach in real-world applications, providing an effective and efficient solution for scene text detection tasks.
Conclusion
In summary, we have introduced an efficient framework for real-time detection of medical and arbitrary-shaped text, capitalizing on the capabilities of a vision-language pre-training model. Our approach integrates the Efficient Feature Enhancement Module (EFEM) and Multi-Scale Feature Fusion Module with RSEConv (MSFM-RES) to improve feature extraction without incurring significant computational overhead. Through extensive experiments on multiple datasets, our method has showcased significant improvements in both speed and accuracy compared to prior state-of-the-art text detectors.
Future endeavors could concentrate on further refining the computational efficiency of our framework to enable real-time text detection in resource-constrained environments. Exploring advanced vision-language pre-training models and leveraging more extensive training data may enhance the model’s performance and generalization capabilities. Additionally, expanding the framework to address more challenging text detection scenarios, such as scene text in complex backgrounds or multi-lingual text, presents an intriguing avenue for future research.
Data availability
The data sets used and/or analyzed during the current study are available from the corresponding author on reasonable request.
References
Baek Y, Lee B, Han D, et al (2019) Character region awareness for text detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9365–9374
Chen YC, Li L, Yu L, et al (2020) Uniter: Universal image-text representation learning. In: European conference on computer vision, Springer, pp 104–120
Ch’ng CK, Chan CS (2017) Total-text: A comprehensive dataset for scene text detection and recognition. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), IEEE, pp 935–942
CMedOCR (2022) Medical Inventory Invoice OCR Element Extraction Task . https://tianchi.aliyun.com/dataset/131815
Dai J, Li Y, He K, et al (2016) R-fcn: Object detection via region-based fully convolutional networks. Advances in neural information processing systems 29
Deng D, Liu H, Li X, et al (2018) Pixellink: Detecting scene text via instance segmentation. In: Proceedings of the AAAI conference on artificial intelligence
Fan DP, Cheng MM, Liu JJ, et al (2018a) Salient objects in clutter: Bringing salient object detection to the foreground. In: Proceedings of the European conference on computer vision (ECCV), pp 186–202
Fan DP, Gong C, Cao Y, et al (2018b) Enhanced-alignment measure for binary foreground map evaluation. arXiv preprint arXiv:1805.10421
Fan DP, Wang W, Cheng MM, et al (2019) Shifting more attention to video salient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8554–8564
Gupta A, Vedaldi A, Zisserman A (2016) Synthetic data for text localisation in natural images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2315–2324
Han Q, Yin Q, Zheng X, et al (2021) Remote sensing image building detection method based on mask r-cnn. Complex & Intelligent Systems pp 1–9
He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
He K, Gkioxari G, Dollár P, et al (2017a) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969
He M, Liao M, Yang Z, et al (2021) Most: A multi-oriented scene text detector with localization refinement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8813–8822
He P, Huang W, He T, et al (2017b) Single shot text detector with regional attention. IEEE
He W, Zhang XY, Yin F, et al (2017c) Deep direct regression for multi-oriented scene text detection. In: Proceedings of the IEEE international conference on computer vision, pp 745–753
Hu H, Zhang C, Luo Y, et al (2017) Wordsup: Exploiting word annotations for character based text detection. arXiv e-prints
Jia C, Yang Y, Xia Y, et al (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning, PMLR, pp 4904–4916
Karatzas D, Gomez-Bigorda L, Nicolaou A, et al (2015) Icdar 2015 competition on robust reading. In: 2015 13th international conference on document analysis and recognition (ICDAR), IEEE, pp 1156–1160
Li G, Duan N, Fang Y, et al (2020) Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In: Proceedings of the AAAI conference on artificial intelligence, pp 11336–11344
Li LH, Yatskar M, Yin D, et al (2019) Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557
Liao M, Shi B, Bai X, et al (2017) Textboxes: A fast text detector with a single deep neural network. In: Proceedings of the AAAI conference on artificial intelligence
Liao M, Shi B, Bai X (2018) Textboxes++: A single-shot oriented scene text detector. IEEE Trans Image Process 27(8):3676–3690
Liao M, Shi B, Bai X (2018) Textboxes++: A single-shot oriented scene text detector. IEEE Trans Image Process 27(8):3676–3690
Liao M, Zhu Z, Shi B, et al (2018c) Rotation-sensitive regression for oriented scene text detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5909–5918
Liao M, Wan Z, Yao C, et al (2020) Real-time scene text detection with differentiable binarization. In: Proceedings of the AAAI conference on artificial intelligence, pp 11474–11481
Liao M, Zou Z, Wan Z et al (2022) Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(1):919–931
Lin J, Jiang J, Yan Y, et al (2022) Dptnet: A dual-path transformer architecture for scene text detection. arXiv preprint arXiv:2208.09878
Liu W, Anguelov D, Erhan D, et al (2016) Ssd: Single shot multibox detector. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, Springer, pp 21–37
Liu Z, Lin G, Yang S, et al (2018) Learning markov clustering networks for scene text detection. arXiv preprint arXiv:1805.08365
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Long S, Ruan J, Zhang W, et al (2018) Textsnake: A flexible representation for detecting text of arbitrary shapes. In: Proceedings of the European conference on computer vision (ECCV), pp 20–36
Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32
Lyu P, Liao M, Yao C, et al (2018a) Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In: Proceedings of the European conference on computer vision (ECCV), pp 67–83
Lyu P, Yao C, Wu W, et al (2018b) Multi-oriented scene text detection via corner localization and region segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7553–7563
Ma J, Shao W, Ye H et al (2018) Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans Multimedia 20(11):3111–3122
Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, PMLR, pp 8748–8763
Ren S, He K, Girshick R, et al (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28
Su W, Zhu X, Cao Y, et al (2019) Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530
Tan H, Bansal M (2019) Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490
Tian Z, Huang W, He T, et al (2016) Detecting text in natural image with connectionist text proposal network. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, Springer, pp 56–72
Tian Z, Shu M, Lyu P, et al (2019) Learning shape-aware embedding for scene text detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4234–4243
Wang F, Jiang M, Qian C, et al (2017) Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Wang P, Zhang C, Qi F, et al (2019a) A single-shot arbitrarily-shaped text detector based on context attended multi-task learning. Proceedings of the 27th ACM International Conference on Multimedia
Wang W, Xie E, Li X, et al (2019b) Shape robust text detection with progressive scale expansion network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9336–9345
Wang W, Xie E, Song X, et al (2019c) Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8440–8449
Wang Y, Xie H, Zha ZJ, et al (2020) Contournet: Taking a further step toward accurate arbitrary-shaped scene text detection. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11753–11762
Xie E, Zang Y, Shao S, et al (2019) Scene text detection with supervised pyramid context network. In: Proceedings of the AAAI conference on artificial intelligence, pp 9038–9045
Xue C, Zhang W, Hao Y et al (2022) Language matters: A weakly supervised vision-language pre-training approach for scene text detection and spotting. Springer, Cham
Yao C, Bai X, Liu W, et al (2012) Detecting texts of arbitrary orientations in natural images. In: 2012 IEEE conference on computer vision and pattern recognition, IEEE, pp 1083–1090
Yao C, Bai X, Liu W (2014) A unified framework for multioriented text detection and recognition. IEEE Trans Image Process 23(11):4737–4749
Yuliang L, Lianwen J, Shuaitao Z, et al (2017) Detecting curve text in the wild: New dataset and new solution. arXiv preprint arXiv:1712.02170
Zhang C, Liang B, Huang Z, et al (2019) Look more than once: An accurate detector for text of arbitrary shapes. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10552–10561
Zhang H, Zu K, Lu J, et al (2022) Epsanet: An efficient pyramid squeeze attention block on convolutional neural network. In: Proceedings of the Asian Conference on Computer Vision, pp 1161–1177
Zhang SX, Zhu X, Hou JB, et al (2020) Deep relational reasoning graph network for arbitrary shape text detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9699–9708
Zhang SX, Zhu X, Yang C, et al (2021) Adaptive boundary proposal network for arbitrary shape text detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1305–1314
Zhang Z, Zhang C, Shen W, et al (2016) Multi-oriented text detection with fully convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4159–4167
Zhao H, Shi J, Qi X, et al (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2881–2890
Zhou X, Yao C, Wen H, et al (2017a) East: an efficient and accurate scene text detector. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 5551–5560
Zhou X, Yao C, Wen H, et al (2017b) East: an efficient and accurate scene text detector. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 5551–5560
Zhu X, Hu H, Lin S, et al (2019a) Deformable convnets v2: More deformable, better results. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9308–9316
Zhu X, Hu H, Lin S, et al (2019b) Deformable convnets v2: More deformable, better results. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Acknowledgements
This work was supported by the Scientific Research Funds of Northeast Electric Power University (No. BSZT07202107).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Li, T., Bai, J. & Wang, Q. Enhancing medical text detection with vision-language pre-training and efficient segmentation. Complex Intell. Syst. 10, 3995–4007 (2024). https://doi.org/10.1007/s40747-024-01378-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40747-024-01378-3