1 Introduction

Matching abbreviated names with their full names plays a key role in so many tasks like data integration, address matching and information retrieval [1]. In the context of the Chinese language, many common nouns often have established shortened forms, commonly known as abbreviations. Due to differences in data collection standards and sources, the same object is frequently represented using various abbreviations and full names in big data applications. During data cleaning and integration, it is essential to match these abbreviations and full names effectively to ensure accurate data merging.

With the continuous development of information technology, information retrieval has become a common application scenario. Addressing the disparity between abbreviations and their full names is crucial for improving retrieval efficiency and accuracy. Matching of abbreviations with their expansions (full name) faces numerous challenges as traditional methods often only consider acronyms and ignore near-homophone and near-homoglyph in Chinese.

This paper presents a method for Chinese full-abbr matching by combining phonetic and glyph information. The Chinese near-homophone full-abbr matching model based on SimBert and VGG can effectively and accurately match homophone full names. Meanwhile, the Chinese near-homoglyph full-abbr matching model, which integrates multiple information sources in DenseNet, can thoroughly learn Chinese glyph information, enabling efficient matching of Chinese near-homoglyph full names and abbreviations. By combining these matching models and utilizing expert knowledge for precise ranking of matched full names, this approach achieves a human-like cognitive process, effectively improving the accuracy of this task.

Compared to previous methods, our approach exhibits notable differences in several aspects: (1) We construct a comprehensive matching model by integrating SimBert, VGG, and DenseNet, considering the combined impact of phonetic, glyph, and multi-modal information. (2) We implement precise ranking of matched full abbreviations through expert knowledge, further enhancing accuracy. (3) This comprehensive and efficient approach enables our method to address challenges in full-abbr matching more effectively, resembling the implementation of human cognitive processes.

Overall, our method is more holistic compared to previous approaches as it comprehensively considers near-homophone and multi-modal information. This effectively addresses challenges in full-abbr matching, significantly improving the accuracy of the task. Figure 1 illustrates the overall workflow of the proposed CFNAM-PG model.

Fig. 1
figure 1

Overall Workflow of the CFNAM-PG Model

2 Related Works

In the aspect of near-homophone full-abbr matching task, Liangwu et al. [3] proposed a combination method that utilizes inverted indexing and rules. They built an index based on phonetics to tackle the issue of near-homophone matching for abbreviations. However, this method faces challenges, such as recall explosion and absence of unified standards for expert knowledge. Microsoft introduced the Unilm model [4], based on the BERT architecture, which employs BERT for Seq2Seq tasks to generate vector representations of input. An improved version called SimBert was subsequently developed, but it primarily emphasizes the construction of semantic vectors and does not extensively model phonetic information. Sun et al. [5] utilized ChineseBERT to integrate Chinese phonetics and glyph information, resulting in substantial improvements in downstream tasks through large-scale Chinese pre-training. Lu et al. [6] introduced ViLBERT, a model that independently processes inputs from various modalities and subsequently integrates visual and textual features into a transformer model.

For the task of near-homoglyph full-abbr matching, Zhan [7] innovatively introduced a Chinese character generation technique using Generative Adversarial Networks (GAN). This technique initially employs GAN to generate offline Chinese character samples, which are then input into a Chinese character recognition model for identification. Zengxian [8] proposed an improved edit distance algorithm, utilizing information, such as Wubi and character structure. As deep learning technology advances, an increasing number of researchers are turning to deep learning methods to address the issue of recognizing similar Chinese characters. Qi et al. [9] proposed a glyph similarity algorithm that utilizes image processing and Wubi encoding to convert Chinese characters into feature vectors and stroke order codes, calculating similarity through binary differences and an improved Jaro-Winkler distance. Bhagyasree et al. [10] introduced a new deep learning technique, the Directed Acyclic Graph Convolutional Neural Network (DAG-CNN), which enhanced the accuracy of handwritten character recognition, surpassing traditional methods and other deep learning techniques. Al Janabi [11] achieved multi-level community discovery and in-depth analysis on a complex and vast citation database by constructing an initial network, extracting and cleaning keywords, building a classification model, and adding topics to the initial network. Wang et al. [12] proposed a method for recognizing similar Chinese characters based on deep convolutional neural networks, achieving high recognition accuracy by extracting features using convolutional neural networks.

3 Chinese Near-homophone Full-Abbreviation Matching Model

VGG is used as the speech vector learning model, utilizing spectrograms to learn speech vectors. Spectrograms are generated by converting speech signals into time–frequency images through Fourier transformation. The frequency domain signal is divided into windows and subjected to a logarithmic transformation, resulting in a two-dimensional matrix known as a spectrogram. In Chinese speech recognition, deep learning models are typically employed to learn speech vectors. Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs) perform convolution operations on spectrograms to extract local features, followed by pooling layers to reduce feature dimensions. Multiple convolution and pooling layers form a convolutional neural network model, and the convolutional features are eventually mapped to the speech vector space through fully connected layers.

We conducted pre-training on the VGG16 and CTC (Connectionist Temporal Classification) [2] combined model using existing speech recognition datasets. We fine-tuned the VGG16 model parameters and used its fully connected layer output as the speech vector. Through pre-training, the model can capture the speech characteristics of the text. Utilizing the pre-trained VGG16 output vectors, we transformed the text's speech information into vectors and then fused them with the neural network model to effectively enhance the understanding of textual speech information. By integrating the improved SimBert model, which is built upon SimBert and VGG, as the framework for the Chinese near-homophone full-abbr matching model that combines character and speech features, the model can learn vector representations of textual speech, ultimately leading to improved accuracy in near-homophone full-abbr matching.

The basic framework of the Chinese near-homophone full-abbr matching model that combines character and speech features is illustrated in Fig. 2.

Fig. 2
figure 2

The flow chart of Chinese near-homophone full-abbr matching

3.1 Chinese Near-homophone Abbreviation Matching Model Architecture

First, we incorporate speech information into SimBERT to learn joint statistical features of Chinese speech and characters through a large corpus of training data, addressing issues, such as recall explosions in traditional models and absence of unified standards for expert knowledge definition and combination. As shown in Fig. 3, the speech for both parts of the Chinese near-homophone full name and abbreviation is concatenated using “[CLS]” and “[SEP]” to form an “Input token”. For example, ‘江苏科技大学’ (meaning ‘Jiangsu University of Science and Technology’) is concatenated with its abbreviation name ‘江科大’. The Speech Representation Model (SRM) consists of a pre-trained VGG16 model and processes the full name and abbreviations separately to obtain character speech vectors. The encoded vectors are then used as “Speech embeddings” and undergo encoding through the “Fusion Layer” to obtain character vectors with fused speech features. These fused features are combined with positional and identification features of entity names and served as input to SimBERT. Finally, the output at the [CLS] position of SimBERT represents the embedding for the first piece of text.

Fig. 3
figure 3

The Chinese near-homophone full-abbr matching model integrating character and speech features

The pre-trained speech recognition model takes spectrograms as input for the VGG16 model. The output of the VGG16 model is then fed into a CTC decoding model, which directly transcribes the audio waveform signal into Chinese pinyin. For example, if we take the character '科' and convert it into a spectrogram using the speech recognition model, the result is the pinyin 'ke.'

For model training, the THCHS30 dataset and the ST-CMDS dataset from Tsinghua University are used as the training set for Chinese speech data. As shown in the formulas (14), for an entity name of length\(l\), the input and the output are represented as follows: \(X=\{{X}_{1}, {X}_{2}, \cdots ,{X}_{i}, \cdots , {X}_{l}\}\) and\(O=\{{O}_{1}, {O}_{2}, \cdots , {O}_{i}, \cdots , {O}_{l}\}\). The intermediate layer of the model is represented as H = {\({H}_{1}, {H}_{2}, \cdots , {H}_{i}, \cdots , {H}_{l}\}\). Here, \({X}_{i}\) represents a Chinese character, and \({O}_{i}\) represents the output pinyin for the character. Open-source spectrogram generation tools like Wave are used to convert Chinese pinyin into spectrograms, which are then used as input for the speech recognition model. The pre-trained model maps the spectrograms to the recognized pinyin \({O}_{i}\) for\({X}_{i}\). Following pre-training, the model's optimal weights are determined, and we select \(VG{G}{^\prime}+Dens{e}_{128}\) as the Speech Representation Model (SRM). The pre-trained model achieves end-to-end training of the entire speech recognition model using speech data, enabling representation learning for speech. This means that the speech of Chinese characters is used as the input to the SRM, which outputs speech vectors representing the corresponding Chinese characters. The specific details of the model are shown in the following formulas:

$${X}_{i}{^\prime}=\text{Wave}({X}_{i})$$
(1)
$$VGG16=VG{G}{^\prime}+{\text{Dense}}_{1000}$$
(2)
$${H}_{i}=VG{G}{^\prime}({X}_{i}{^\prime})+{\text{Dense}}_{128}$$
(3)
$${O}_{i}=CTC(\text{Softmax }({\text{Dense}}_{1428}({H}_{i})))$$
(4)

In the provided formula, “\({\text{Dense}}_{128}\)” and “\({\text{Dense}}_{1428}"\) refer to fully connected layers. “\({\text{Dense}}_{1428}\)” is followed by a Softmax layer to obtain the probability distribution of each speech category in speech recognition, and this distribution is used for training and optimizing the speech recognition model. The Speech Representation Model (SRM) is designed to optimize the CTC loss based on the real labels in the training dataset. To obtain speech vectors for text sequences, some modifications are made to the parameters of the VGG16 model. The final fully connected layer of the VGG16 model, originally with a size of 1000, is resized to 128. Spectrograms are used as input to the VGG16 network. After passing through 13 convolutional layers and 3 fully connected layers, a 128-dimensional vector representation, denoted as \({H}_{i}\), is obtained.

Furthermore, it is necessary to process the full name and abbreviation from the dataset as a continuous text sequence. This sequence needs to be transformed into three parts: “Char embedding,” “Position embedding,” and “Segment embedding.”

First, the text sequence undergoes Word Piece operations, breaking it down into basic units, and is then converted into an index sequence using indexing. Subsequently, it is input into an embedding layer to obtain the “Char embedding” vector, which contains text identification information.

Next, following Word Piece operations, the text sequence is transformed into a sequence of numbers representing positional information within the text, and this sequence is input into an embedding layer to obtain the “Position embedding” vector, which contains relative positional information of the basic characters.

Furthermore, to distinguish between different parts of the text sequence based on full name and abbreviated text, 0 and 1 are used to represent the first and second parts of the basic characters in the text sequence. The resulting numerical sequence is then passed through an embedding layer to obtain the “Segment embedding” vector.

After completing these steps, three vectors are obtained for the text sequence: “Char embedding”, “Position embedding” and “Segment embedding”. These vectors serve as inputs to the Fusion Layer, along with the “Speech embedding” to fuse character information and speech information. The Fusion Layer combines these vectors and feeds them into a fully connected layer, ultimately producing a 128-dimensional Fusion embedding vector. The Fusion Layer formulas are as follows: For a text sequence \(p\), with a length of \({l}_{p}\), where \(p=\{{p}_{1},{p}_{2},\cdots {p}_{i},\cdots {p}_{{l}_{p}}\}\) and \({p}_{i}\) represents a Chinese character, the Char embedding and Speech embedding for \({p}_{i}\) can be represented as \({C}_{i}=\left\{{v}_{1},{v}_{2},\cdots ,{v}_{k},\cdots ,{v}_{128}\right\}\) and \({H}_{i}=\{{v}_{1}{^\prime},{v}_{2}{^\prime},\cdots ,{v}_{k}{^\prime},\cdots ,{v}_{128}{^\prime}\}\), respectively. The output is represented as \({F}_{i}\). As shown in formulas (5) and (6), ‘\(\text{Concat}\)’ denotes the concatenation of the input vectors, followed by passing through a fully connected layer to obtain a 128-dimensional fusion vector. This vector serves as the fused representation that combines both character and speech information.

$${F}_{i}={\text{Dense}}_{128}(\text{Concat}({C}_{i},{H}_{i}))$$
(5)
$${\text{Dense}}_{128}(x)={W}_{F}*x+b$$
(6)

Following the Fusion Layer, the obtained "Fusion embedding" is combined with the "Position embedding" and "Segment embedding" as input to SimBert for model training.

Ultimately, the trained Chinese near-homophone full-abbr matching model based on SimBert and VGG produces output vectors, which contain the fused character and speech information. These vector representations are then utilized for subsequent matching tasks.

3.2 Similarity Calculation

The first position serves as the output of the SimBert model, and the vector at the first position represents the text between [CLS] and [SEP]. Therefore, for retrieval purposes, to obtain vectors for both the full name and the abbreviation, they need to be separately placed between [CLS] and [SEP] as two inputs to the model. The model will then produce two separate outputs, one for the full name and one for the abbreviation. Cosine similarity is used to calculate the distance between these vectors, enabling the matching of near-homophone full name and abbreviation. The calculation formula is as follows:

$$\text{Score}=\text{cos}\theta =\frac{X\cdot Y}{\Vert X\Vert \hspace{0.33em}\Vert Y\Vert }=\frac{{\sum }_{i=1}^{n}{x}_{i}{y}_{i}}{\sqrt{{\sum }_{i=1}^{n}{x}_{i}^{2}}\times \sqrt{{\sum }_{i=1}^{n}{y}_{i}^{2}}}$$

By calculating the angle between the vectors \(X\) and \(Y\) representing the full name and abbreviation, we determine the degree of similarity between them. When the full name and abbreviation are more similar, the angle between their corresponding vectors is smaller, and the cosine value is larger.

4 Chinese Near-homoglyph Full-abbr Matching Model

4.1 Chinese Near-homoglyph Full-abbr Matching Model Architecture Based on DenseNet

Fusing Chinese glyph information for near-homoglyph full-abbr matching involves two main tasks: statistical feature extraction and image text recognition. By extracting features, such as stroke number, Wubi encoding, and character structure from the text, we can effectively capture glyph information. This approach enhances the matching of near-homoglyph full names and abbreviations.

Leveraging DenseNet and CTC models for image text recognition takes advantage of DenseNet deep layers and complex network structure, enabling more accurate extraction of textual information from images while avoiding gradient explosion issues. By combining textual statistical features with DenseNet, we have designed a Chinese full name and abbreviation recognition method that fuses both phonetic and glyph information. This approach incorporates a multi-modal fusion process with human-like cognitive mechanisms, effectively enhancing the model's recognition accuracy.

First, it is necessary to obtain a dataset of text and its corresponding image data, process the input images into sequential data, and then utilize DenseNet for feature extraction and sequence modeling. Next, the sequential data is input into the CTC model to obtain the corresponding text recognition results. The DenseNet model serves as the feature extraction model for text glyph information, extracting the output vectors of the neural network's intermediate layers, known as glyph embeddings. These output vectors from the trained DenseNet model contain both textual and image information. Simultaneously, through the feature extractor, Wubi features, stroke features, structural features, and other textual statistical information are extracted. These statistical features are then combined with glyph embeddings to obtain the stroke vector information of the text, which can be used for full name and abbreviation retrieval tasks. The basic framework of the fusion multi-modal DenseNet model is illustrated in Fig. 4 below.

Fig. 4
figure 4

A DenseNet-based Chinese near-homoglyph full-abbr matching model that integrates font structure and image features

The primary objective of the Chinese near-homoglyph full-abbr matching model based on the fusion multi-modal DenseNet is to achieve matching for Chinese near-homoglyph full names and abbreviations. To address this challenge, the model adopts a fusion multi-modal approach, incorporating both textual and image information. DenseNet is employed for image feature extraction. By leveraging this fusion multi-modal approach, the model can effectively harness the complementarity of textual and image information, thereby enhancing accuracy and robustness.

4.2 Statistical Feature Extractor for Glyph Information

The statistical feature extractor is a technique for extracting Chinese character features based on the text's structure. It extracts statistical features of the text, including stroke features, Wubi features, and structural features, to achieve the recognition of Chinese characters and obtain the text glyph information. The following sections will introduce the extraction of these various statistical features separately.

4.2.1 Stroke Features

The number of strokes in a character is essential information that represents the character's complexity. Characters with more strokes indicate greater complexity and convey richer information. Therefore, considering the number of strokes as a factor in near-homoglyph full-abbr matching tasks is advantageous. For example, characters like '犬' (quǎn), '大' (dà), and '术' (shù) have 4, 3, and 5 strokes, respectively. This means that the character '大' is more similar to '犬' in terms of stroke count, indicating a higher similarity between '大' and '犬'.

Incorporating the stroke count of characters as a crucial feature into near-homoglyph full-abbr matching tasks helps capture the shape similarity between Chinese characters, thereby enhancing matching accuracy. Using DenseNet, a vector is generated for each Chinese character, and the stroke count information for each character is fused into the vector. Specifically, the vector for each character is weighted by its corresponding stroke count and then summed to create a new vector that contains stroke count information. This vector is subsequently used to generate sentence vectors. By utilizing stroke count information, a more accurate description of character features can be achieved, resulting in improved matching results. Furthermore, the relative complexity between characters, i.e., the difference in stroke count, is also taken into account to further enhance matching accuracy. In summary, the introduction of stroke count information improves the effectiveness of Chinese character matching tasks and holds theoretical and practical value. Similarity calculations for full-abbr matching are carried out separately using vectors generated by DenseNet and vectors weighted by stroke count.

4.2.2 Wubi Features

The Wubi glyph input method is based on the strokes and shapes of Chinese characters. Its coding rules divide Chinese characters based on their strokes and shapes, converting them into corresponding pinyin using Wubi encoding. As shown in Fig. 5, according to the Wubi input method standard, a total of 6,757 characters from the basic character set of Chinese characters are encompassed. Chinese characters can be deconstructed into various basic components, enabling their glyph information to be converted into Wubi encoding following specific rules. This encoding method can represent all Chinese characters and generates a character form database for commonly used Chinese characters by utilizing the Wubi encoding of these characters. Each Chinese character can be represented by a maximum of 4 letters in its Wubi encoding. For example, the Wubi encoding for '抱' (bào) is 'rqn,' and the Wubi encoding for '孢' (bāo) is 'bqn.'

Fig. 5
figure 5

The coding rule of Wubi

To incorporate Wubi encoding into the vector generated by the DenseNet model, the Wubi encoding is converted into a 20-bit binary vector, with each bit of the encoding occupying 5 binary digits. For encodings with fewer than 20 bits, leading zeros are added to pad them to 20 bits. For example, 'a' is encoded as 00001, and 'y' is encoded as 11,001. Subsequently, the binary vector of the Wubi encoding is combined with the glyph vector, and similarity scores for similar characters are calculated using cosine similarity.

4.2.3 Structural Features

The structural information of Chinese characters includes details about their glyphs and compositional structure. The basic glyph information of Chinese characters primarily consists of 32 stroke types, as depicted in Table 1. Each character is comprised of these stroke types in a specific order. Hence, the glyph information is encoded using 5-bit binary codes for each glyph stroke type. Subsequently, these binary codes are combined with the glyph information vector generated by DenseNet.

Table 1 The glyph stroke type of Chinese characters

For example, the composition of the glyph of character '湖'(lake, Pinyin: hu) and its corresponding vector are shown in Table 2. As '湖' (lake, Pinyin: hu) is composed of 12 glyph components, its glyph vector spans 60 dimensions.

Table 2 The glyph vector construction of the Chinese character '湖'(lake, Pinyin: hu)

Chinese characters are categorized into 13 types based on their structural information, and these 13 categories are encoded using 4-bit binary representation, as illustrated in Table 3 with examples.

Table 3 The typical structures and encodings of Chinese characters

The structural information of Chinese characters is a powerful tool for differentiation in Chinese text. By combining the glyph and structural composition information, we establish a structure profile for each Chinese character. This encoded structural information is subsequently integrated with the information generated by DenseNet, enhancing its utility in near-homoglyph full-abbr matching. Chinese characters that share similar structural compositions exhibit closer vector distances.

From Table 4, it is evident that when weighting by stroke count, vectors with similar strokes have higher similarity, whereas vectors with dissimilar strokes display lower similarity, resulting in increased distinctiveness. By incorporating Wubi encoding information into the vector generated by the DenseNet model, the fused vector demonstrates higher similarity compared to the vector lacking Wubi information. Similarly, when the encoded structural information is combined with the vector information generated by DenseNet, the fused vector showcases higher similarity compared to the vector without structural information. This enhancement underscores the effectiveness of these fusion approaches in improving vector similarity and ultimately contributing to the model's performance.

Table 4 Comparison of vector similarity calculation results

5 CFNAM-PG: Bridging Phonetic and Glyphic Information for Chinese Full-abbr Matching

In the realm of deep learning, various models possess distinct learning capabilities and dimensions. These dissimilarities among models can result in variations in the retrieved documents during retrieval tasks, potentially leading to retrieval errors. The fusion of multiple models allows us to establish more reasonable classification boundaries and achieve improved performance. Combining different models is a common strategy to enhance model performance. Voting, a straightforward yet effective ensemble learning method, can enhance model accuracy and robustness by consolidating predictions from multiple base models through voting or averaging. Therefore, we adopt a Voting approach to combine expert knowledge, glyph information, and phonetic information.

Initially, we extract the glyphic and phonetic information from the full name and abbreviations. We employ the Chinese near-homophone model, based on SimBert and VGG, to generate phonetic vectors, while the Chinese near-homoglyph model which utilizes the Fusion of Multiple Information DenseNet, generates glyph vectors. The main idea behind our model fusion is to match the phonetic vectors of abbreviations with those of full names and abbreviations and match the glyph vectors of abbreviations with those of full names and abbreviations. The candidates for the full name and abbreviation in this task are determined based on the retrieved results from both matching methods. For the final matching score, we assign different weights to the glyph, phonetic, and expert knowledge features. We incorporate expert knowledge and a voting strategy as follows:

  1. (1)

    Knowledge about length of Full Name and Abbreviation.

The closer the length of the candidate's full name and abbreviation, the higher its priority. The formal definition and calculation of expert knowledge are as follows:

$${f}_{1}(An, Can) = \frac{\text{length}(An)}{\text{length}(Can)}$$
(7)

Here, \(An\) represents the abbreviation, \(Can\) represents the candidate's full name, and length denotes the length operation. For example, if \(An\) = "建行" (Pinyin: Jian hang) and we have \(Can1\) = "建设银行江苏分行"(Construction Bank Jiangsu Branch, Pinyin: Jian she Yin hang Jiangsu fen hang) and \(Can2\) = "建设银行" (Construction Bank, Pinyin: Jian she yin hang) as candidate full name, based on this expert knowledge, we can observe that \({f}_{1}\left(An, Can1\right)<{f}_{1}\left(An, Can2\right)\), indicating that the priority of \(Can1\) is lower than \(Can2\).

  1. (2)

    The Ratio of Characters in Full Name Derived from the Abbreviation.

The candidate's full name is more likely to be the best match when it contains all the characters from the abbreviation. The higher the ratio of characters from the abbreviation that appear in the candidate's full name, the higher the priority of that candidate. The form and calculation of expert knowledge are as follows:

$${f}_{2}(An, Can)=\frac{na\_\text{word}(An, Can)}{\text{length}(An)}$$
(8)

For instance, if \(An\) = "建行" (an abbreviation name of ‘Construction Bank’, Pinyin: Jian hang,) and we have \(Can1\) = "工商银行"( Industrial and Commercial Bank, Pinyin: Gong Shang yin hang,) and \(Can2\) = "建设银行"( Construction Bank, Pinyin: Jian She yin hang) as candidate full name, based on this expert knowledge, we can see that \({f}_{2}(\text{An},\text{ Can}1)<{f}_{2}(\text{An},\text{ Can}2)\), indicating that the priority of \(Can1\) is lower than \(Can2\).

  1. (3)

    Voting Strategy.

Using the Chinese near-homophone full-abbr matching model based on SimBert and VGG, we obtain phonetic similarity scores between the full name and abbreviation. Similarly, the Chinese near-homoglyph full-abbr matching model, which fuses multiple types of information from DenseNet, provides glyph similarity scores between the full name and abbreviation. Expert Knowledge 1 and 2 deliver additional similarity scores for the full name and abbreviation, respectively.

By assigning weights to these four components and averaging them based on their respective similarity scores, we calculate a comprehensive similarity score. This score serves as the final matching result, enabling us to rank and filter high-similarity matches between full names and abbreviations. This approach facilitates the matching of Chinese near homophones and near homoglyphs in full name and abbreviation contexts.

In this study, we have defined the weights of the four components as 2, 2, 1, and 1 through extensive testing.

6 Experiments and Analysis

6.1 Dataset

Training Data: For the Chinese near-homophone full-abbr matching model based on SimBert and VGG, the training corpus is derived from the Sougou Input dictionary (https://shurufa.sogou.com/) and the data provided by QiChaCha (https://www.qcc.com/). The dataset for the Chinese near-homoglyph full-abbr matching model primarily comprises Chinese document text recognition data. This dataset contains around 3.64 million images, divided into a training set and a validation set in a 99:1 ratio. The image data was generated with random variations in font, size, grayscale, blur, perspective, and stretching, using Chinese language corpora (news and classical Chinese). All images were standardized to a resolution of 280 × 32 pixels. The text data includes 5990 characters, comprising Chinese characters, English letters, digits, and punctuation. Each sample consists of 10 fixed characters randomly extracted from sentences in the language corpus.

Test Data: Following statistical analysis, data cleaning, and other preparatory procedures, we compiled a corpus known as “docCorpus”. This effort culminated in a parallel corpus containing approximately 40 million instances of near-homophone and near-homoglyph full names and abbreviations. From this corpus, we selected 100,000 samples to use as the test set. Additionally, we integrated 300 manually annotated samples of near-homoglyph full names and abbreviations to serve as test data for the experiment. The remaining corpus, constituting the training set, was employed for the model's training process. Some examples of manual annotations of near-homoglyph full names and abbreviations are shown in Table 5.

Table 5 Corpus example

Algorithm 1 outlines the process of generating a training corpus for near-homophone characters, as depicted in Fig. 6. In the Output section, variable queryCorpus is defined to store the newly generated corpus of near-homophone abbreviations. In line 7, a random number between 0 and 1 is generated to mask 50% of the characters for near-homophone masking. The function Random_choose() in line 6 randomly selects near-homophone characters from the set. Line 9 represents the full abbreviation corpus after applying the random near-homophone mask. The variable defined in line 12 is employed to receive the abbreviated corpus after the random near-homophone mask. Repeating the process 10 times in line 14 ensures that the number of abbreviation pairs for each full abbreviation is controlled to 10 pairs. Finally, the function Random_select() in line 16 randomly selects characters in order to form abbreviations.

Fig. 6
figure 6

Algorithm for generating near-homophone training corpus

Parameters and Experimental Settings: The SimBert model primarily utilizes the SimBert Tiny model architecture. On top of this architecture, it is adjusted to have 4 layers, each having an input dimension of 128 and 12 heads. The batch size for the training dataset is set to 128, and the learning rate is configured as 0.0005*0.4**epoch.

Simbert uses a special mask mechanism to make the model also have the ability to generate text. Therefore, two loss functions are used to optimize the model at the same time, namely the cross-entropy loss of seq2seq and the full abbreviation of cross-entropy loss. Finally, the two losses are added to obtain the joint loss function of the model.

6.2 Experimental Results

To validate the model’s effectiveness in the Chinese near-homophone full-abbr matching task, we conducted comparative experiments and ablation experiments. Bold text in the tables shows the best results for the corresponding indicators. The experimental results are shown in Tables 6 and 7.

Table 6 BM weight experiments
Table 7 The experimental results of near-homophone full-abbr matching

We employed the Combined BM algorithm as a comparative approach, which combines both character-based BM (BM stands for 'Boyer-Moore') and pinyin-based BM algorithms. The pinyin-based BM algorithm is an enhancement of the character-based BM algorithm, enabling pinyin-based fuzzy matching. These two algorithms were combined for comparison, and experiments were carried out on the manually annotated test set for near-homophone full names and abbreviations to investigate the influence of the weight distribution between the pinyin-based BM and character-based BM algorithms on the experimental results. The experimental results are shown in Table 7.

Results in Table 6 indicate that, for the near-homophone full abbreviation dataset, the importance of the Pinyin BM algorithm surpasses that of the character-level BM algorithm in terms of matching accuracy. With a weight ratio of 18.4:1.6, the combined performance of the Pinyin BM algorithm and the character BM algorithm achieves the highest matching accuracy for near-homophone full-abbr matching. In this table, “Acc” represents the matching accuracy, and Recall indicates the rate at which the algorithm successfully recalls the full name of the target when each abbreviation recalls 10 full names.

Through the analysis of the experimental results in Table 7, it is observed that the model proposed in this paper significantly outperforms other models in the task of near-homophone full-abbr matching. Its matching accuracy (0.752) is notably higher than the Joint BM algorithm, Bert, and Simbert, while achieving superior recall (0.864) and F1 score (0.804). In comparison to other models, the average processing time of the proposed model (25.08) is relatively shorter, demonstrating comprehensive advantages in terms of accuracy, completeness, and efficiency.

As shown in Table 8, in the ablation experiments, the Simbert + VGG model performs the best in the task of full-abbr matching, with high matching accuracy (0.752), recall (0.864), and F1 score (0.804). Although the Character BM model demonstrates better processing efficiency, its matching performance is slightly inferior. The (Simbert + VGG)-[Voice] model, after removing voice information, experiences a slight decline in matching performance but shows an improvement in processing efficiency. Therefore, in balancing accuracy and efficiency, the Simbert + VGG model is a preferable choice.

Table 8 Ablation experiments of near-homophone full-abbr matching

To investigate the impact of different models on image text recognition accuracy and the influence of the proposed statistical features on Chinese near-homoglyph full-abbr matching, we conducted comparative and ablation experiments. The comparative experiments primarily involved different image recognition models, including ResNet, CRNN, CNN, and DenseNet models. The results of the comparative experiments are shown in Table 9.

Table 9 Experimental results of different models on image text recognition

From the comparative experimental results in Table 9, it is evident that the proposed DenseNet + CTC model demonstrates higher accuracy compared to other models. This suggests that it is more effective in extracting character glyph information in image text recognition tasks than the other models.

To validate the performance of the statistical feature model in extracting text glyph information, ablation experiments were conducted by individually removing the stroke feature f1, Wubi feature f2, and structural feature f3. The results of these ablation experiments are shown in Table 10.

Table 10 Ablation Experimental results of near-homoglyph full-abbr matching

Based on the results of the ablation experiments in Table 10, it can be concluded that the Wubi encoding information has a more significant impact on text recognition compared to structural information, resulting in decreased accuracy and recall rates. By observing the experimental results, it becomes evident that using all three types of information improves the model's efficiency in extracting text glyph information, thereby increasing the accuracy, recall, and F1 score for the Chinese near-homoglyph full-abbr matching task.

After integrating the two models trained in this paper with expert knowledge, we obtained the experimental results of the integrated model, followed by a brief analysis of its results.

From the integrated model experimental results in Table 11, it can be concluded that the fusion method using weighted sum has achieved the best results. The results indicate that the proposed Chinese near-homophone full-abbr matching model based on SimBert and VGG and a DenseNet-based model that fuses glyph structure and image features all contribute to different dimensions of Chinese full-abbr matching. Due to the differences in the content and dimensions learned by the two models, the two retrieval tasks produce different retrieval boundaries. Expert knowledge further enhances the integrated model's performance. The fusion of models optimizes the retrieval boundary for Chinese full-abbr matching and further enhances the performance of retrieval tasks, resulting in an integrated model with higher accuracy and F1 value than a single model.

Table 11 Experimental results of integrated models of near-homoglyph full-abbr matching

6.3 Method Analysis

The paper proposes a near-homophone full-abbr matching model that integrates both character and voice features, as well as a near-homoglyph full-abbr matching model based on DenseNet, incorporating character structure and image features. It thoroughly explores the integration of character and statistical feature knowledge into neural network models, demonstrating its feasibility through experiments. The significant advantages are as follows:

  1. (1).

    Multimodal Fusion: Two models are designed in this paper, one fusing character and voice features and the other fusing character structure and image features. The multi-modal fusion strategy fully utilizes information from different modalities, contributing to the overall performance improvement of full-abbr matching task.

  2. (2).

    Integration of Expert Knowledge: Additional expert knowledge is incorporated into the integrated model. Using a weighted sum fusion strategy optimizes the matching process, further enhancing accuracy and recall. This integration method helps fully leverage the strengths of each model.

  3. (3).

    Comprehensive Consideration of Character Knowledge: By incorporating Wubi features, stroke features, and structural features of characters, the model in this paper demonstrates good performance in the task of near-homoglyph full-abbr matching. This comprehensive consideration of character knowledge enhances the reliability and effectiveness of the model in full-abbr matching.

7 Limitations of the Method

This proposed model is primarily designed for the Chinese near-homophone and near-homoglyph full-abbr matching task, which may pose limitations when applied to other languages or tasks. Future research could consider expanding the applicability to a broader range of linguistic contexts and tasks.

Although we have integrated two models to handle character and image information, the modal information remains relatively limited. Future research could explore the incorporation of additional modalities, such as voice, context, etc., to enhance the diversity and generalization of the model.

8 Conclusion

This paper first presents a Chinese near-homophone full-abbr matching model that integrates character and phonetic features. Then, we introduce a DenseNet-based Chinese near-homoglyph full-abbr matching model that combines character structure and image features. Expanding upon the two models introduced in this paper and leveraging domain expertise, we further harnessed the strengths of these models. We employed a fusion strategy based on weighted summation to enhance the Chinese full-abbr matching process, resulting in improved accuracy and recall for Chinese full-abbr matching.

In future research, this work can be extended and deepened in the following two aspects:

  1. (1)

    This paper utilizes two models to extract information from different modalities. In future research, it would be valuable to explore large-scale generative models capable of handling multiple text modalities. This could enable the acquisition of diverse modality knowledge, ultimately enhancing the process of matching Chinese full names and abbreviations.

  2. (2)

    This paper focuses solely on utilizing glyph information related to character form and pronunciation for matching full names and abbreviations. In future research, it may be beneficial to explore the integration of knowledge graphs and the utilization of attribute similarities within these graphs to address cross-lingual Chinese full-abbr matching tasks. Such an approach can help mitigate the influence of language environments on Chinese full-abbr matching.