Interpretable Tongue Constitution Recognition via Reshaped Wavelet Attention

Wen, Guihua; Liang, Haozan; Li, Huihui; Wen, Pengcheng; Chen, Rui; Li, Cheng

doi:10.1007/s44196-024-00402-1

Interpretable Tongue Constitution Recognition via Reshaped Wavelet Attention

Research Article
Open access
Published: 14 February 2024

Volume 17, article number 31, (2024)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computational Intelligence Systems Aims and scope Submit manuscript

Interpretable Tongue Constitution Recognition via Reshaped Wavelet Attention

Download PDF

Guihua Wen¹,
Haozan Liang¹,
Huihui Li ORCID: orcid.org/0000-0003-0463-8178²,
Pengcheng Wen¹,
Rui Chen³ &
…
Cheng Li³

676 Accesses
Explore all metrics

Abstract

Currently deep neural networks have been used to perform the tongue constitution recognition, but they are still challenged, failing to extract nice multi-scale and multi-level features. This paper proposes a novel interpretable tongue constitution recognition method based on the reshaped wavelet attention. It separates multi-scale features through discrete wavelet transform and then uses the attention mechanism to weight them. Subsequently, these features are reshaped to the high-dimensional space where the association knowledge of multi-level features are mined and hierarchized so as to fuse them efficiently. Finally, both are integrated into the framework of convolution neural network to generate the more accurate tongue image attributes, by which the tongue constitution recognition is performed. The proposed method not only obtains the higher performance with small cost, but also nicely interprets them. Experimental results show that the proposed method is effective, efficient, and interpretable.

Natural tongue physique identification using hybrid deep learning methods

Article 31 July 2018

Tongue Color Classification Based on Convolutional Neural Network

Classification of Tongue Color Based on Convolution Neural Network

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Traditional Chinese Medicine (TCM) regards that the individual’ constitution refers to the stable internal features formed by both the innate inheritance and habits acquired in the process of life, including morphology, structure, physiological, and psychological states. Thus the individual’s constitution reflects its current physical state and the future development trend of the health. It is the basis for diagnosing, treating, and preventing diseases [2, 3]. Consequently the constitution recognition is not only suitable for patients, but also for healthy population who can early understand its health status and then prevent diseases [4]. Constitution identification can be realized through four diagnosis methods in TCM, which are looking, listening, asking, and feeling the pulse [9]. However, they require the rich clinical experience of doctors [1, 6]. Thus modern technology has been used to realize the auxiliary constitution identification, where the constitution is defined as nine types by Chinese society of traditional Chinese medicine [10, 52]. They are Qi-deficiency, Yang-deficiency, Yin-deficiency, Phlegm-dampness, Damp-heat, Blood-stasis, Qi-depression, Special-diathesis, and Gentleness [5]. The early methods are based on the constitution questionnaire scale [7, 8]. They provide a questionnaire for the individual to choose answers and then calculate the score of each constitution type, by which the constitution type is determined. Because there are lots of questions in the scale, it takes a long time to obtain all answers. There are methods such as decision tree which focus on reducing the number of questions [11]. Another issue is that results are easily influenced by the individual subjectivity when answering questions. Furthermore, they may not understand some questions, leading to choose error answers. Besides questionnaire scale, there are some other methods. For example, the fuzzy linguistic variables are combined with the judgment results of TCM doctors to form new samples for classifying individuals’ constitution [12]. The physical examination indexes such as blood routine indexes and urine routine indexes are for TCM constitution types [13]. The data mining methods such as association rules are used for TCM constitutions [14]. Besides, the face images, voice and pulse signals are also applied [15, 20, 21].

By comparison, the tongue images are more widely used for the recognition of individual health and disease status [22], classification of shapes and TCM syndromes [23, 25]. They are effective non-invasive technique that can be used to assess the health status of patients [26]. Changes in the internal organs of the body are usually shown on the tongue, such as the texture and color. Therefore, tongue images can be used to help explore the physiological functions and pathological changes of the human body. However, the tongue diagnosis by doctors requires the face-to-face communication, which greatly depends on doctors’ experience and lacks the objective and quantitative judgment rules. Recently, automated tongue diagnosis methods have been proposed to solve this problem [1, 27–32]. They usually include the tongue image acquisition, tongue image segmentation, and tongue image classification. Tongue image acquisition is the first step of computerized tongue diagnosis, as the quality of tongue images has an important impact on labeling and analyzing each tongue image [33, 34]. The tongue image segmentation aims to effectively filter the interference of the background information and then help improve the following classification performance [28, 29]. Tongue image classification can be regarded as the common image classification. Early tongue constitution recognition methods use traditional image processing methods to extract features, such as the color, texture, and shape features [10, 16, 30, 35]. For example, the color features are extracted in HSV [36], Lab [41], and HSI color space [38] respectively. The texture features can be also extracted [39]. These features can be also combined with body features [40]. After features are extracted, these methods use the traditional machine learning methods to perform the constitution classification [10]. However, these methods rely on manually designed features [37]. Due to the limitation of human experience and professional knowledge, the designed features may be incomplete, interrelated, and redundant, easily resulting in the poor performance. In recent years, deep neural networks have been used to extract features of the tongue images for the constitution recognition [31, 32]. For example, convolution neural network, gray level co-occurrence matrix, minimum bounding rectangle and edge curve are combined to extract tongue image features, and then classify them into one of constitution types [24]. The hybrid deep learning method is also applied to recognize the constitution through the tongue images. It uses the lightweight convolution network to complete the initial tongue detection, and then uses another calibration network to find the refined area, so as to better recognize the constitution type [42]. Furthermore, a novel method varying with the complexity of samples is also proposed to improve the accuracy of constitution classification [43]. To overcome problems of the class imbalance and small samples, the prototype network [45] and novel method based on the zero sample learning [44] are proposed.

Although these methods have made in progress, they still have difficulties to extract discriminative features. This is because they do not extract multi-level features, leading to features without diversity. They also fail to adaptively fuse features from different levels, resulting in the incomplete features. To solve these issues, this paper proposes a novel tongue constitution recognition method based on the reshaped wavelet attention (RWA). The main contributions are as follows:

(1)
The wavelet attention is applied to obtain multi-scale features through discrete wavelet transform and then the attention mechanism is used to weight them.
(2)
The reshaping mechanism is proposed to construct the high dimensional space composed of features from different levels, where the association rules are mined and then used to fuse features efficiently.
(3)
The wavelet attention and reshaping mechanism are integrated into convolution neural network to create the more accurate attributes by which the tongue constitution recognition can be performed with the higher performance and better interpretability.

Section 2 introduces the related work. Section 3 presents the domain knowledge. Section 4 introduces the wavelet attention while the reshape fusion is presented in Sect. 5. The new method is proposed in Sect. 6. Experimental results are presented in Sects. 7 and 8 presents conclusions.

2 Related Work

As our method performs the constitution recognition via tongue images, the related methods for the constitution recognition will be compared. As our method is also related to wavelet attention, they will be also analyzed.

2.1 Questionnaire Methods

The questionnaire methods are mostly used for the constitution recognition [7, 8, 17], which follow judgment criterion of TCM constitution [10, 52]. In order to reduce the number of questions in the questionnaire, the decision tree has been applied [11]. Particularly, questionnaire can be dynamically formed according to individual healthy state [17]. The questionnaire scale can be also combined with the tongue image to further improve the performance [46]. These methods are much simple to implement and accurate if all questions are answered correctly. However, results are easily influenced by individual subjective attitudes. Besides, the examinee may not understand some questions so that his answers may be error.

2.2 Traditional Machine Learning Methods

Traditional machine learning methods have been applied to perform the constitution recognition based on tongue images [10]. They differ in the used features, including color, texture, and shape features [10, 16, 35]. There are some methods that use both tongue features and body features [40], including color features in HSV color space [36], Lab color space [41], and HSI color space [38], while the texture features are also used [39]. The fuzzy linguistic variables for tongue images are combined with the judgment results by several TCM doctors to form the database for classifying individuals’constitution [12]. Besides, the association rules within the cloudy framework are mined to classify TCM constitutions [14]. Besides tongue images, face images, voice signals, pulse signals, and physical examination indexes such as blood routine and urine routine indexes are also applied for the constitution recognition [13, 15, 18, 19]. After features are extracted, traditional machine learning methods are used to perform the constitution classification. However, these methods rely on manually designed features [37], which may be incomplete, interrelated, and redundant due to the limitation of human experience and professional knowledge, easily resulting in the poor performance. Our method can overcome these shortcomings by automatically learning features from the tongue images.

2.3 Deep Neural Network Methods

Deep neural networks have been applied to realize the constitution recognition [47]. For example, Inception-v3 model is used to classify the constitution with nine types [40], which has 208 tongue images for training. Another method uses the convolution neural network to extract features of tongue images, having three categories and 483 tongue images for training [24]. A larger database is ever applied to recognize the constitution, where the tongue detection is also performed [47]. Another better method is based on the complexity perception of tongue images [43], which performs the constitution recognition for the testing tongue image by selecting the classifier with the suitable complexity. This idea has been validated in the latter method [48]. Besides tongue images, voice signals and pulse signals are also used in convolutional neural networks to realize the constitution recognition [47]. Furthermore, the face images are also used for the constitution classification, where multilevel and multi-scale features aggregation method within the convolutional neural network are used [49]. As tongue images are not easy to collect, leading to the smaller training data that is insufficient for deep learning methods [46], so that zero-shot learning methods can be considered [50, 51, 72]. In addition, new deep neural network architecture can be adapted to deal with our issue that fuses and utilizes both local and global information simultaneously [73]. There are also novel methods proposed to efficiently deal with problems of uncertainty and concept drift [74, 75]. However, these methods are not directly suitable for the constitution recognition. There is a method that uses domain knowledge and latent attributes to recognize the constitution [44]. Unlike our method, these methods cannot extract multi-scale features and multi-level features simultaneously through discrete wavelet transformation.

2.4 Reshaped Wavelet Attention

Recently wavelet transformation has been applied to design new neural networks with improved performance. For example, attention mechanism-based wavelet convolution neural network has been proposed for EEG classification [55]. It firstly uses multi-scale wavelet analysis to decompose the input EEG into lots of components with different frequency bands, which are then input into the network with an attention mechanism to extract features for the classification. Another selective wavelet attention method learns a series of wavelet attention maps to guide the separation of rain and background information in both spatial and frequency domains [53]. Specifically, a new wavelet-attention block is designed which implements attention in the high-frequency domain [54]. However, unlike our method, these methods are not for tongue constitution recognition. They do not exploit the domain knowledge with the reshaped wavelet attention.

3 Domain Knowledge

In the clinical tongue diagnosis, a doctor may first observe the patient’s tongue image, such as tongue color, tongue shape, tongue quality, etc., and then determine the its attributes. Finally, the doctor determine the constitution of patient according to the relationship between constitution type and attributes of the tongue image. According to Chinese national standard [52], attributes of the tongue image for each constitution type are different [44]. They have been summarized in Table 1. Because Special-diathesis constitution is much complicated which cannot be accurately described, we can only use “others” as its semantic attributes. In addition, these attributes are semantically grouped: tongue color, tongue body, and tongue nature. In this way, each constitution type of the tongue image can be represented by a fifteen dimensional semantic vector $(A_1,\ldots , A_{15})$ where attributes are encoded in the one-hot mode, that is, 1 indicates that the constitution has corresponding attributes and 0 indicates that the constitution has no corresponding attributes. In order to facilitate the following description, we define the relationship between attributes of the tongue image and its corresponding constitution as the attribute matrix $w_{\text{attribute}}\in {\mathbb{R}}^{15\times 9},$ where each constitution type corresponds to a specific attribute vector. As the priori domain knowledge, it can be used to help neural network model to predict the constitution type more accurately.

Table 1 Attributes of tongue image for each constitution type

Full size table

4 Wavelet Attention

In order to mine the multiple scales features, the wavelet attention framework is proposed in Fig. 1. Given the input features with dimension $H\times W\times C,$ four components with dimension $H/2\times W/2\times C$ can be obtained after two-dimensional discrete wavelet (DWT) transform is performed. These feature components are applied to obtain the corresponding attention mask $M_s\in R^{\frac{H}{2}\times \frac{W}{2}}$ through spatial attention and position normalization, which are then weighted and concatenated according to the attention mask to aggregate the output features ${X}'\in R^{\frac{H}{2}\times \frac{W}{2}\times 4C}.$

Features from two-dimensional discrete wavelet transform: It decomposes the input data into various components with different frequencies, while the spatial dimension is just half of the original one. Thus it can replace the down sampling operation such as maximum pooling and mean pooling in convolutional neural networks. The proposed wavelet attention by DWT can be described as follows:

$$\begin{aligned} X {\mathop {\longrightarrow }\limits ^{{\text{DWT}}}} \{X_{LL},X_{LH},X_{HL},X_{HH}\} \in {\mathbb{R}}^{ \frac{H}{2} \times \frac{W}{2} \times C} \end{aligned}$$

(1)

where $X \in {\mathbb{R}}^{W\times H \times C}$ is the input features, $X_{LL}$ denotes the low-frequency component that retains the main information of the original feature, while $X_{HH}$ denotes the high-frequency component that often contains noise or texture information. $X_{LH}$ and $X_{HL}$ denotes components with compound frequencies. As geometric and texture information is needed for tongue constitution recognition task, the high-frequency components should be also considered. Consequently the wavelet attention module adaptively aggregates features from these four components to provide features as comprehensive as possible for the tongue constitution recognition.

Features aggregation based on spatial attention and position normalization: In order to better aggregate four kinds of features decomposed by discrete wavelet transform, spatial attention (SA) and position normalization (PN) operations are proposed. Spatial attention focuses on the important position of features so as to emphasize important features and suppress unnecessary features. Because the tongue constitution recognition task is closely related to spatial features of the tongue image, wavelet attention adopts the spatial attention mechanism by four different attention masks. Furthermore, in order to consider the different contribution of different frequency to aggregated features, position normalization is proposed to normalize the weight of each spatial position in the attention mask.

Given $\{X_{LL},X_{LH},X_{HL},X_{HH}\}$ as inputs, the wavelet attention module infers the spatial attention mask for each component, normalizes its spatial position, finely adjusts its importance, and then aggregates the finely tuned features. As shown in Fig. 1, this process can be simply summarized as follows:

$$\begin{aligned}{} {} \{X_{LL},X_{LH},X_{HL},X_{HH}\}{\mathop {\longrightarrow }\limits ^{{\text{SA}}}} \{M_S^{LL},M_S^{LH},M_S^{HL},M_S^{HH}\} \end{aligned}$$

(2)

$$\begin{aligned}{} {} \{M_S^{LL},M_S^{LH},M_S^{HL},M_S^{HH}\}{\mathop {\longrightarrow }\limits ^{{\text{PN}}}} \{ {\bar{M}}_S^{LL},{\bar{M}}_S^{LH},{\bar{M}}_S^{HL},{\bar{M}}_S^{HH}\} \end{aligned}$$

(3)

$$\begin{aligned}{} {} {X}'_{AB} ={\bar{M}}_S^{AB} \otimes X_{AB},A \in \{L,H\},\quad B \in \{L,H\} \end{aligned}$$

(4)

$$\begin{aligned}{} {} {X}'=[{X}'_{LL} {X}'_{LH} {X}'_{HL} {X}'_{HH}] \end{aligned}$$

(5)

where ${\bar{M}}_S^{AB}$ is the attention mask and ${X}'$ is the output features after aggregation. As shown in the upper right corner in Fig. 1, the spatial attention mask $M_S\in {\mathbb{R}}^{H\times W}$ is generated by the spatial attention for the input F through:

$$\begin{aligned} M_S=f_{3\times 3}(f_{\text{AVG}}(F)) \end{aligned}$$

(6)

where $f_{\text{AVG}}$ represents the mean pooling operation, and $f_{3\times 3}$ represents the convolution operation with the convolution kernel size of $3\times 3.$ Because there is no relationship between four attention masks, we propose a position normalization operation to learn the relationship between them, aiming to dynamically adjust the weight for each attention mask and learn their complementarity. It is illustrated in the lower right corner in the figure. Given the input attention mask $M_S,$ its weight in spatial coordinates is computed as follows:

$$\begin{aligned} {\bar{M}}_S^{AB(h,w)}=\frac{{\text{e}}^{M_S^{AB(h,w)}}}{\sum _{T\in \{LL,LH,HL,HH\}} {\text{e}}^{M_S^{T(h,w)}}} \end{aligned}$$

(7)

where $M_S^{AB(h,w)}$ represents the weight of the input attention mask in spatial coordinates (h, w), and ${\bar{M}}_S^{AB(h,w)}$ represents the weight of the output attention mask in the same spatial coordinates. After generating the attention mask and normalizing the position of the attention mask, features from four components are aggregated by the simple concatenation operation to superimpose the input features that increase the diversity. This is because features obtained by the two-dimensional discrete wavelet transform have different frequencies and obviously do not belong to the same distribution.

5 Reshape Fusion

Wavelet attention can improve the ability of convolutional neural network to extract multi-scale features for the tongue image, but it cannot extract multi-level features efficiently. This is because concepts of the scale and level are different. The scale refers to the grain size or spatial resolution. The level refers to the degree of semantics abstract. The wavelet attention can automatically select the optimal spatial resolution. However, it cannot easily change the semantic level of features, as it works only on features of the given semantic level. In the deep neural network, the features in the shallow levels tend to represent the details of geometry and texture, while the deeper features tend to represent more abstract semantic information. When fusing features of different levels, their different contribution should be considered. Thus an innovative reshape fusion method is proposed, which dynamically integrates multi-level features from different network layers, enhances the importance of key features and reduces irrelevant features. In more details, it first obtains a one-dimensional aggregation features by concatenating multiple level features, rearranges them into three-dimensional space using the reshaping operation, and then learn the relationship between reshaped features. Subsequently, it uses the inverse reshaping operation to obtain the one-dimensional relationship mask from the obtained three-dimensional relationship mask, which can be further applied to weight the one-dimensional aggregation features. Finally, the contribution degree of different features is weighted according to the one-dimensional relationship mask.

Reshape operation: As shown in Fig. 2, features $F_C \in {\mathbb{R}}^{ C}$ is first converged by concatenating operation:

$$\begin{aligned}{} F_C=[F_1\oplus F_2\oplus F_3\oplus F_4] \in {\mathbb{R}}^{ C} \end{aligned}$$

(8)

$$\begin{aligned} C=C_1+ C_2+ C_3+ C_4 \end{aligned}$$

(9)

where $F_i \in {\mathbb{R}}^{ C_i}$ is the input one-dimensional features, $C_i$ is the number of features, $\oplus$ representing the concatenation operation. Next, the reshape operation rearranges features $F_C \in {\mathbb{R}}^{ C}$ along the spatial dimension to obtain the reshaped features $F_R \in {\mathbb{R}}^{ H\times W \times K},$ where H, W, K are the height, weight, the number of channels, and $C=H\times W\times K.$

$$\begin{aligned}{} & {} F_R=\phi _R(F_C)=\phi _R([u_1,u_2,\ldots ,u_C]=[f_1,f_2,\ldots ,f_K] \\{} &\quad =\left[ \begin{array}{c} \left[ \begin{array}{ccc} u_1&{}\ldots &{}u_W\\ \ldots &{}\ldots &{}\ldots \\ u_{(H-1)\times W+1} &{}\ldots &{} u_{H\times W} \end{array}\right] \\ \left[ \begin{array}{ccc} u_{H\times W+1}&{}\ldots &{}u_{(H+1)\times W}\\ \ldots &{}\ldots &{}\ldots \\ u_{(2H-1)\times W+1} &{}\ldots &{}u_{2H\times W} \end{array}\right] \\ \left[ \begin{array}{ccc} u_{C-H\times W+1}&{}\ldots &{}u_{C-(H-1)\times W}\\ \ldots &{}\ldots &{}\ldots \\ u_{C-W+1}&{}\ldots &{} u_{C} \end{array}\right] \end{array}\right] \in {\mathbb{R}}^{H\times W\times K} \end{aligned}$$

(10)

where $\phi _R$ represents the reshaping function, $u_i$ represents each component of feature, and $f_i$ represents the feature map. The reshape operation does not introduce additional parameters that need to be learned, so it does not increase the number of parameters.

Relationship interaction learning: Different features have different contributions to tasks. As shown in Fig. 2, relationship interaction learning is proposed to learn the relationship weights between different features. It models 3D relationship mask $M_{3D}\in {\mathbb{R}}^{H\times W\times K}$ between reshaped features $F_R$ as follows:

$$\begin{aligned}{} & {} M_{3D}=F_R\circledast \theta _{3\times 3} =[f_1,f_2,\ldots ,f_K]\circledast \theta _{3\times 3} \\ &\quad= \left[ \begin{array}{c} \left[ \begin{array}{ccc} v_1&{}\ldots &{}v_W\\ \ldots &{}\ldots &{}\ldots \\ v_{(H-1)\times W+1} &{}\ldots &{} v_{H\times W} \end{array}\right] \\ \left[ \begin{array}{ccc} v_{H\times W+1}&{}\ldots &{}v_{(H+1)\times W}\\ \ldots &{}\ldots &{}\ldots \\ v_{(2H-1)\times W+1} &{}\ldots &{}v_{2H\times W} \end{array}\right] \\ \left[ \begin{array}{ccc} v_{C-H\times W+1}&{}\ldots &{}v_{C-(H-1)\times W}\\ \ldots &{}\ldots &{}\ldots \\ v_{C-W+1}&{}\ldots &{} v_{C} \end{array}\right] \end{array}\right] \in {\mathbb{R}}^{H\times W\times K} \end{aligned}$$

(11)

where $\circledast$ represents the convolution operation and $\theta _{3\times 3}$ represents convolution kernel with scale. $M_{3D}$ is a learned three-dimensional relationship mask, $v_i$ corresponds to the adaptive weight of the element $u_i$ in $F_R.$ When extracting the relationship between different features, relationship interaction learning uses the convolution operation rather than the full connection operation, which can reduce the amount of parameters. For features with the input dimension $C=H\times W\times K,$ the number of parameters for two methods are $K\times 3\times 3\times K=9K^2$ and $(HWK)\times (HWK)=H^2W^2K^2$ respectively. Obviously, our method uses the less parameters. In addition, the full connection operation cannot encode the spatial neighborhood relations between features, while the convolution operation extracts the relations from multiple different spatial neighborhoods. Thus the internal relationships of multiple different neighborhoods are considered simultaneously, obtaining the more fine relationship between features.

Reverse reshape operation: Because the input features $F_C$ belong to one-dimensional space, whereas the three-dimensional relationship mask $M_{3D}$ cannot directly perform element by element multiplication with them to adjust the contribution weight of each feature, an inverse reshape (IR) operation $\phi _{IR}$ is proposed:

$$\begin{aligned} M_{1D}=\phi _{IR}(M_{3D}) =[v_1,v_2,\ldots ,v_C]\in {\mathbb{R}}^C. \end{aligned}$$

(12)

Subsequently, the input features can be scaled by multiplying the one-dimensional relationship mask as follows:

$$\begin{aligned} {F}'=F_C\otimes \sigma (M_{1D})\in {\mathbb{R}}^C \end{aligned}$$

(13)

where $\otimes$ represents element by element multiplication operation and $\sigma$ is defined as the sigmoid function.

6 Proposed Constitution Recognition Method

The constitution recognition based on the reshaped wavelet attention (RWA) through the tongue image is proposed whose framework is shown in Fig. 3. It is an end-to-end learning framework, taking the tongue image as the input and the predicted constitution type as the output. RWA integrates the wavelet attention (WA) and reshape fusion (RF) into the given convolutional neural network such as ResNet18. For the input tongue image, our method uses multi-stage convolution layers to extract features, wavelet attention to augment features, and reshape fusion to automatically fuse features. Subsequently, RWA maps the obtained tongue image features to the latent semantic attributes space to obtain the predicted attribute vector. Finally, the distance between the predicted attribute vector and the true attribute vector is calculated, and outputs the constitution type with the minimum distance. RWA simulates the diagnosis process of the doctor, so that it greatly improves the performance of the constitution recognition and is stable, accurate, rapid and interpretable. Semantics of attributes for each constitution type are certain and can be easily understood by doctors. As illustrated in Fig. 3, when our method predicts the constitution type of the tongue image, it also predicts its attributes. The predicted attributes provide for the interpretation. On the other hand, the extracted features in the different layer of the backbone network are of different semantic levels. The association knowledge among them is mined and hierarchized by the reshape fusion module as illustrated as Fig. 2.

In the framework, RWA takes ResNet18 as the backbone network, having five stages [59]. Except for the first stage, all other stages are composed of residual blocks with different output feature scales. Let $\phi _{i}$ represents the residual block of the $i^{th}$ stage, given the input tongue image $X \in {\mathbb{R}}^{H\times W\times C},$ features of all stages will be obtained through multiple residual blocks, denoted as $\{X_1,\ldots ,X_5\}.$ These features are decomposed, weighted and aggregated by wavelet attention, and then fused to obtain multiple scale features denoted as $\{F_1,\ldots ,F_5\}.$ Subsequently, they are further reshaped to obtain the final tongue image features ${F}' \in {\mathbb{R}}^{C}$ through reshaping fusion operation. The above process can be described as:

$$\begin{aligned} X_1=\phi _{1}(X) \end{aligned}$$

(14)

$$\begin{aligned} X_i=\phi _{i}(X_{i-1}),\quad i\in \{2,3,4,5\} \end{aligned}$$

(15)

$$\begin{aligned} {X}'_1=f_{1\times 1}(\phi _{\text{WA}}(X_i)),\quad i\in \{2,3,4\} \end{aligned}$$

(16)

$$\begin{aligned} F_1=f_{\text{GAP}}(X_2) \end{aligned}$$

(17)

$$\begin{aligned} F_i=f_{\text{GAP}}({X}'_i+X_{i+1}),\quad i\in \{2,3,4\} \end{aligned}$$

(18)

$$\begin{aligned} {F}'=\phi _{\text{RF}}(F_1,F_2,F_3,F_4) \end{aligned}$$

(19)

where $f_{1 \times 1}$ represents the convolution operation, $\phi _{\text{WA}}$ represents the wavelet attention, $f_{\text{GAP}}$ represents the global average pooling, and $\phi _{\text{RF}}$ represents the reshape fusion operation. Subsequently, the predicted attribute vector ${\hat{F}}_{\text{Attribute}} \in {\mathbb{R}}^{C}$ can be obtained by a multi-layer perceptron (MLP) with one hidden layer as follows:

$$\begin{aligned} {\hat{F}}_{\text{Attribute}} =MLP({F}')={F}'\times W_A \end{aligned}$$

(20)

where $W_{A} \in {\mathbb{R}}^{C\times 15}$ is the learnable parameters of MLP. Now we calculate the similarity between the predicted attribute vector and the true attribute vector for each constitution, and then output the predicted probabilities for each constitution ${\hat{Y}} \in {\mathbb{R}}^{1\times 9}$ according to the similarity. In order to speed up the recognition, the calculation process is simplified as follows:

$$\begin{aligned} {\hat{Y}}= \text{softmax}({F}'_{\text{Attribute}}\times W_{\text{Attribute}}) \end{aligned}$$

(21)

$$\begin{aligned}= \text{softmax}([{\hat{Y}}_1,{\hat{Y}}_2,\ldots ,{\hat{Y}}_9]) \end{aligned}$$

(22)

$$\begin{aligned} {\hat{Y}}_i= \frac{{\text{e}}^{{\hat{y}}_i}}{\sum _{j=1}^9{\text{e}}^{{\hat{y}}_j}} \end{aligned}$$

(23)

$$\begin{aligned} {\hat{y}}_i= {\hat{F}}_{\text{Attribute}}\odot (W^i_{\text{Attribute}})^{\text{T}}. \end{aligned}$$

(24)

The tongue constitution recognition task can be regarded as the task of image classification, so that we define the constitution classification loss as the cross entropy loss:

$$\begin{aligned} {\mathcal {L}}_{\text{cls}}=-Y \cdot \log ({\hat{Y}})=-\sum _{i=1}^9Y_i\cdot \log ({\hat{Y}}_i) \end{aligned}$$

(25)

where $Y\in {\mathbb{R}}^{1\times 9}$ represents the true constitution type vector of the input tongue image and $Y_i$ represents the probability of the constitution type i. If i is the true constitution type of the input tongue image, $Y_i=1$ else $Y_i=0.$

In order to better constrain the predicted attributes ${\hat{F}}_{\text{Attribute}}\in {\mathbb{R}}^{1\times 15}$ closer to the true attributes of the input tongue image, we further introduce the attribute embedding loss to shorten the distance between the predicted attributes and true attributes.

$$\begin{aligned} {\mathcal {L}}_{\text{AE}}=\sum _{j=1}^9 \Vert {\hat{F}}_{\text{Attribute}}-(W^j_{\text{Attribute}})^{\text{T}}\Vert ^2, Y_j=1 \end{aligned}$$

(26)

where $W^j_{\text{Attribute}}$ represents the true attribute vector corresponding to the given constitution type. Thus the total loss is defined as

$$\begin{aligned} {\mathcal {L}}_{\text{total}}={\mathcal {L}}_{\text{cls}}+{\mathcal {L}}_{\text{AE}}. \end{aligned}$$

(27)

7 Experimental Results

Lots of experiments are carried out on the tongue images for the constitution recognition task to evaluate the proposed RWA in effectiveness, efficiency, and interpretability.

7.1 Databases

As a large number of tongue images with labelled constitution types cannot be available, we construct a tongue image constitution database which is composed of tongue images and their constitution types. They are from outpatient departments of hospitals using the camera to collect the tongue images of patients. In the collected images, the effective tongue area only occupies a small part, and the other information outside the tongue is interference information for the tongue diagnosis. In order to extract the tongue region, we uses the objective detection model [42] to separate the effective tongue image from the background, where only small part of the background information are contained.

As shown in Fig. 4, the constructed database contains 46,753 images with nine types, among which 80% is used for training and validation, and the remaining is for testing. Specifically, both training set and validation set contain 37,398 samples, while the test set has 9355 samples.

7.2 Implementation Details

In order to make fair comparison, we use the deep learning framework pytorch and the model library Timm to implement all compared methods. As methods such as VIT-small [57], VIT-small-pretrained [57], VIT-base [57], VIT-base-pre trained [57], Shift-S [58] and Shift-S-pretrained [58] are difficult to converge under the other hyper-parameters, they follow the optimal implementation details in the original paper [57, 58]. Besides, our method RWA and the other compared methods adopt the same super-parameters to ensure the fairness as much as possible. In the training, the random gradient descent algorithm with the weight attenuation of $1\text{e}{-}4,$ momentum of 0.9, and batch size of 64 is used to learn parameters. The total training rounds of the model are 300. The learning rate starts from 0.1 and decays to $\frac{1}{10}$ of the last value every 50 rounds. When training the model, only the general data augmentation method is used, that is, the size of the input image is scaled to $224 \times 224$ and then 4 pixels are filled around the image edge in a symmetrical way. Subsequently, the image with the size of $224 \times 224$ is cut out randomly and then flipped at the probability of 0.5. During the test phase, the image is scaled to $224 \times 224.$ In addition, the training and test images are normalized by subtracting the mean and then divided by the standard deviation.

The constitution recognition is a single label classification task, where each tongue image corresponds to a true constitution label. Thus the performance of each model in experiments is evaluated by the accuracy rate.

7.3 Ablation Studies

As our method RWA includes components such as tongue attributes, wavelet attention, and reshape fusion, the effectiveness and necessity of each component should be verified by ablation experiments. It can be seen from Table 2 that the constitution recognition accuracy of the proposed method under different combination of components exceeds the baseline ResNet18 [59]. When only tongue attributes are used, the accuracy of the proposed method is still 1.82% higher than that of the baseline, indicating that the introduction of tongue attributes is very effective for the tongue constitution recognition. At the same time, it also indicates that learning from diagnostic ideas of doctors can effectively help model improve the accuracy of the constitution recognition. After the wavelet attention is further combined, RWA surpasses the baseline with the improved performance by nearly 2.73%. It is by nearly 0.91% higher than RWA with tongue attributes only, showing that the wavelet attention is effective. While tongue attributes are combined with reshape fusion, RWA exceeds the baseline with the improved performance by 2.45%, better than RWA with tongue attributes only by about 0.64%. It shows that the reshape fusion can effectively fuse features from different levels. Compared with the baseline, RWA has achieved an accuracy improvement of about 3.18%, indicating the effectiveness of wavelet attention and reshape fusion in the constitution recognition task. At the same time, they are complementary and achieve consistent the improvement of the performance. In summary, the proposed method has made great progress in the constitution recognition task, resulted from the combination of tongue attributes, wavelet attention, and reshape fusion.

Table 2 Performance of proposed method with different combinations, where $\surd$ indicates that the component is combined and ResNet18 is the backbone network (Bold: Best Results)

Full size table

7.4 Compared with Constitution Recognition Methods

Some experiments are conducted to verify that the proposed method is superior to the other methods for tongue constitution recognition. It can be seen from Table 3 that the proposed method surpasses all compared constitution recognition methods, obtaining the best result with an accuracy of 53.95%. At the same time, it also consumes the smallest parameters with only 11.88M, showing its efficiency.

By comparison, VGG-Tongue, GoogleNet-Tongue and ResNet-Tongue achieved the poor performance [42], as they have not adjusted the common network structure according to the constitution recognition task. Both GAZ-mResNet18-EA and GAZ-mResNet18-EADLF introduces the zero-shot learning method for the constitution recognition [44], so that they obtain the better performance. They can learn the mapping relationship between tongue image attributes and the constitution type. As they do not contain components of wavelet attention and reshaping fusion, they cannot obtain the best performance in complex constitution recognition tasks. Particularly, their parameters sizes are obviously larger than that of our method.

Table 3 Performance of proposed method and recent tongue constitution recognition methods, where ResNet18 is the backbone network (Bold: Best Results)

Full size table

7.5 Compared with Recent Attention Models

As attention models have shown their excellent performance in computer visual fields, they are compared with our method for the constitution recognition task, where different attention methods are considered, namely channel attention, spatial attention, and mixed attention. The self-attention is recently proposed with the powerful performance and great potential in image classification tasks [61], thus it is also compared. All compared methods are implemented and compared. Their experimental results are shown in Table 4 where the suffix “pretrained” indicates that the model has used additional large-scale training data for pre-training.

Table 4 Performance of proposed method and recent attention methods on the tongue constitution recognition, where ResNet18 is the backbone network (Bold: Best Results)

Full size table

It can be seen from Table 4 that RWA outperforms all compared attention models with the larger margin. Among channel attention methods such as SE-ResNet18, ECA-ResNet18, InI-ResNet50 and SK-ResNet18, the best performance is 52.51%, which is still less than our method by 1.44%. Channel attention only emphasizes the important features in the channel dimension, but they ignore the feature relationship in the spatial dimension, failing to obtain the more complete features of the tongue image. For the spatial attention method such as CBAM-ResNet18, its performance is 50.91%, still worse than our method by 3.04%. In contrast to channel attention, spatial attention only emphasizes the importance of different spatial positions, but does not consider the feature relationships in channel dimensions. Thus it cannot well capture the relationship between different channels in neural networks. On the other hand, the hybrid attention method such as CSRA-ResNet18 combines the channel and spatial attention, but it cannot make full use of global spatial features. It also needs the large amount of computation to generate a three-dimensional attention mask. It can be seen that for self-attention methods such as VIT-small,VIT-small-pretrained, VIT-base, VIT-base-pretrained, Shift-S, and Shift-S-pretrained, their effects without pretraining are very poor, indicating that the self-attention method needs the larger training data. In these methods, Shift-S-pretrained is best, but still fails to surpass our method. This is because it is difficult for them to encode multi-scale features. By contrast, wavelet attention and reshaping fusion can be indeed applied to obtain the more appropriate feature representation for the tongue images so as to perform the constitution recognition task better.

7.6 Compared with Recent Hybrid Deep Learning Methods

Besides attention models, researchers also consider how to improve the performance of deep neural networks from different perspectives, including depth, width, cardinality, and scale. Depth refers to the number of layers in the deep neural network. It can be seen from Table 5 that VGG, ResNet, and DenseNet can improve their feature extraction capability by adding more layers to the deep neural network. Width refers to the number of channels of the features map. For example, Wide-ResNet achieves the better performance by increasing the width. The cardinality refers to the number of convoluted packets. The scale represents the number of feature groups in the deep neural network. We compare our method RWA with these improved models through experiments. It can be seen that RWA still achieves the best performance with small parameters or less computing consumption. On the other hand, when the above factors are added, the over-fitting risk of models is also increasing. For example, the accuracy of ResNet101 with 101 layers is lower than that of ResNet18 with only 18 layers. In the compared models, VGG16 and Res2Net50-14w-8s have the better performance. The former has a simple structure so that it can reduce the over-fitting phenomenon to a certain extent, leading to the better performance. The latter adds multi-scale feature representation, which is more suitable for extracting global and local features of tongue images. It can be summarized that multi-scale features are crucial to the constitution recognition, which is consistent with the idea of our method. Although the parameter size of DenseNet is smallest and its performance is improved with the increase of depth, its maximum performance is only 51.55%, which is still lower than that of our method by 2.40%. It is noteworthy that the parameters of our method are less than that of VGG16, but with the higher accuracy by 1.37%. All these results further show the superiorities of our method.

Table 5 Performance of proposed method and recent hybrid methods for the tongue constitution recognition, where ResNet18 is the backbone network (Bold: Best Results)

Full size table

7.7 Visualization Analysis for Our Method

Parameter size and performance: In order to validate the proposed method RWA in both parameter size and performance, some methods from Tables 3, 4 and 5 are compared. It is generally believed that the less the parameter size is, the better the model, because it can reduce the resource consumption of hardware equipments such as computing power and storage capacity. Moreover, the larger the parameters size, the easier the model is over fitting. It can be seen from Fig. 5 that RWA uses parameters more effectively than any other method, because it achieves the highest accuracy with fewer or similar parameters size, indicating that RWA not only has the ability to achieve the better performance, but also needs the less parameters.

Explicability of the proposed method: As each constitution type can be described by attributes of the tongue image, whereas each attribute has semantics, constitution type can be explained by attributes. Our method can predict attributes and then determine the constitution type by whether they are close to the true attributes marked by doctors, so that the predicted constitution type can be explained by the predicted attributes to a certain extent. Thus the proposed method is interpretable, which can provide evidence for the diagnosis results to both doctors and patients. An example is presented to illustrate the interpretability of the proposed method, where one sample in each constitution type is selected from the test tongue images, its predicted attribute vector and true attribute vector are presented simultaneously. It can be seen from Fig. 6 that for all samples with different constitution types, the predicted attribute vectors are very similar to their true attribute vectors, which not only show that our method can accurately identify constitution types, but also provide attributes as the evidences for the recognized types. Our method alleviates the uncertainty caused by the black box of deep neural network, leading to the possibility for assisting doctors in diagnosis.

8 Conclusion

A new interpretable tongue constitution recognition method is proposed based on reshaped wavelet attention network. Its advantages lie in that the wavelet attention improves the ability of deep neural network to extract multi-scale features, while the reshape mechanism can make neural network extract multi-level features. It also makes full use of the domain knowledge, making that it not only reduces the negative impact of small samples to improve the performance, but also has the interpretability for the predicted constitution type so as to reduce the risk from the black box of deep learning methods. These advantages can make doctors and patients easily understand the diagnostic results and improve the acceptability of tongue constitution recognition methods. Simultaneously, the proposed method has the small parameters size and fast speed, so that it can nicely support lots of lightweight applications. However, the accuracy of the tongue diagnosis is not sufficient, which should be combined with the other diagnosis methods such as listening, asking, and feeling the pulse so as to further improve the performance. This is our future work.

Availability of Data and Material

Data sharing is not provided in this article or analyzed during the current research period.

References

Xu, Y., Wen, G., Yang, P., Hu, Y., Luo, M., Wang, C., Hall, W.: Dynamic task-coupling learning for physical sign-based auxiliary disease diagnosis. IEEE J. Biomed. Health Inform. (2021)
Mihui, L., Shuming, M., Yubao, L., et al. A study of traditional chinese medicine body constitution associated with overweight, obesity, and underweight. Evid-Based Complement. Alternative Med. 2017:1-8 (2017). https://doi.org/10.1155/2017/7361896
Mohammadturusn, N., Xu, Y., Xu, F., Zhang, Y., Tang, Z., Liu, W.: Association study of diabetes mellitus and body constitution of traditional Chinese medicine. Tradit. Med. Mod. Med. 2(1), 1–6 (2019)
Article Google Scholar
Dai, Y., Wang, G., Dai, J., Geman, O.: A multimodal deep architecture for traditional Chinese medicine diagnosis. Concurr. Comput. Pract. Exp. 32(19) (2020)
Liang, X., Wang, Q., Jiang, Z., et al.: Clinical research linking Traditional Chinese Medicine constitution types with diseases: a literature review of 1639 observational studies. J. Tradit. Chin. Med. 40(4), 690–702 (2020)
PubMed Google Scholar
Hu, Y., Wen, G., Liao, H., Wang, C., Dai, D., Yu, Z.: Automatic construction of Chinese herbal prescriptions from tongue images using CNNs and auxiliary latent therapy topics. IEEE Trans. Cybern. (2020)
Wu, H.K., Ko, Y.S., Lin, Y.S., et al.: The correlation between pulse diagnosis and constitution identification in traditional Chinese medicine. Complement. Ther. Med. 30, 107–112 (2017)
Article PubMed Google Scholar
Wong, W., Lam, C.L.K., Su, Y.C., et al.: Measuring body constitution: validation of the body constitution questionnaire (BCQ) in Hong Kong. Complement. Ther. Med. 22, 670–682 (2014)
Article PubMed Google Scholar
Zhang, X., Chen, Z., Gao, J., Huang, W., Li, P., Zhang, J.: A two-stage deep transfer learning model and its application for medical image processing in traditional Chinese medicine. Knowl. Based Syst. 239 (2022)
Qi, W.: Classification and diagnosis of nine basic constitutions in Chinese medicine. J. Beijing Univ. Tradit. Chin. Med. 28, 1–8 (2005)
Google Scholar
Yen, S.J., Chiu, X.D., Ye, S.C.: Intelligent online consultation system for body constitutions. In: International Symposium on Intelligent Signal Processing and Communication Systems: 5G Dream to Reality (2021)
Li, B., Wei, Q., Zhou, X.: Research on model and algorithm of TCM constitution identification based on artificial intelligence. J. Comb. Optim. 42(4), 988–1003 (2021)
Article MathSciNet Google Scholar
Luo, Y., Lin, B., Zhao, S., He, L., Wen, C.: Neural network-based study about correlation model between TCM constitution and physical examination indexes based on 950 physical examinees. J. Healthc. Eng. (2020)
Liu, Y., Zhang, X., Zhang, M., et al. A TCM constitution analysis algorithm based on association rules in hadoop framework[. In: 2020 35th Youth Academic Annual Conference of Chinese Association of Automation (YAC) (2020). https://doi.org/10.1109/YAC51587.2020.9337632
Su, S.Y., Yang, C.H., Chiu, C.C., et al.: Acoustic features for identifying constitutions in traditional Chinese medicine. J. Altern. Complement. Med. 19, 569–576 (2013)
Article PubMed Google Scholar
Zhou, S., Tu, Y., Huang, Z., et al.: Study on the assistant identification of TCM constitution based on tongue feature extraction. Lishizhen Med. Mater. Med. Res. 24, 2734–2735 (2013)
Google Scholar
Fan, B., Li, Y., Wen, G., Ren, Y., Lu, Y., Wang, Z., Zhang, Y., Wang, C.: Personalized body constitution inquiry based on machine learning. J. Healthc. Eng. (2020)
Hou, S., Zhang, J., Li, P., et al.: Research on TCM constitution classification based on facial color and texture. Biomed. Res. 28, 4645–4650 (2017)
Google Scholar
Yichun, W., Lina, B.: Determination of constitution type in TCM pulse examination based on BP neural networks. J. Tradit. Chin. Med. 55, 1288–1291 (2014)
Google Scholar
Er-Yang, H., Gui-Hua, W., Shi-Jun, Z., et al.: Deep convolutional neural networks for classifying body constitution based on face image. Comput. Math. Methods Med. 2017, 1–9 (2017). https://doi.org/10.1155/2017/9846707
Huan, E.Y., Wen, G.H.: Transfer learning with deep convolutional neural network for constitution classification with face image. Multimedia Tools Appl. 79(17–18), 11905–11919 (2020)
Article Google Scholar
Zhang, H., Zhang, B.: Disease detection using tongue geometry features with sparse representation classifier (2014). https://doi.org/10.1109/ICMB.2014.25
Kanawong, R., Obafemi-Ajayi, T., Ma, T., et al.: Automated tongue feature extraction for ZHENG classification in traditional chinese medicine. Evid-Based Complementray Altern. Med. 2012(2), 912852 (2012). https://doi.org/10.1155/2012/912852
Zhou, H., Hu, G., Zhang, X.: Constitution Identification of tongue image based on cnn[C]. In: 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISPBMEI). https://doi.org/10.1109/CISP-BMEI.2018.8633075
Zhang, B., Zhang, H.: Significant geometry features in tongue image analysis. Evid. Based Complement. Altern. Med. (2015)
Tania, M.H., Lwin, K., Hossain, M.A.: Advances in automated tongue diagnosis techniques. Integr. Med. Res. 8(1), 42–56 (2019)
Article PubMed Google Scholar
Hu, Y., Wen, G., Luo, M., et al.: Fully-channel regional attention network for disease-location recognition with tongue images. Artif. Intell. Med. 118, 102110 (2021). https://doi.org/10.1016/j.artmed.2021.102110
Trajanovski, S., Shan, C., Weijtmans, P.J.C., et al.: Tongue tumor detection in hyperspectral images using deep learning semantic segmentation. IEEE Trans. Biomed. Eng. 68(4), 1330–1340 (2020)
Article Google Scholar
Xu, Q., Zeng, Y., Tang, W., et al.: Multi-task joint learning model for segmenting and classifying tongue images using a deep neural network. IEEE J. Biomed. Health Inform. 24(9), 2481–2489 (2020)
Article PubMed Google Scholar
Devi, G.U., Anita, E.A.M.: An analysis of tongue shape to identify diseases by using supervised learning techniques. In: 2017 International Conference on Information Communication and Embedded Systems (ICICES) (2017). https://doi.org/10.1109/ICICES.2017.8070786
Song, C., Wang, B., Xu, J.: Classifying tongue images using deep transfer learning. In: 2020 5th International Conference on Computational Intelligence and Applications (ICCIA) (2020). https://doi.org/10.1109/ICCIA49625.2020.00027
Tang, W., et al.: An automatic recognition of tooth-marked tongue based on tongue region detection and tongue landmark detection via deep learning. IEEE Access 8, 153470–153478 (2020)
Article Google Scholar
Zhang, D., Zhang, J., Wang, Z., et al.: Tongue colour and coating prediction in traditional Chinese medicine based on visible hyperspectral imaging. IET Image Process. 13(12), 2265–2270 (2019)
Article Google Scholar
Hu, Y., Wen, G., Liao, H., et al.: Automatic construction of chinese herbal prescription from tongue image via CNNs and auxiliary latent therapy topics (2018). 10.48550/arXiv.1802.02203
Le, H., Shuanglin, Y., Tangting, L., et al.: Research on constitution of traditional Chinese medicine identification system based on tongue manifestation. Chin. Med. Mod. Distance Educ. China 14, 34–36 (2016)
Google Scholar
Chen, L., Wang, D., Liu, Y., et al.: A novel automatic tongue image segmentation algorithm: Color enhancement method based on L*a*b* color space. In: Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on IEEE (2015). https://doi.org/10.1109/BIBM.2015.7359818
Wang, X., Zhang, B., Yang, Z., Wang, H., Zhang, D.: Statistical analysis of tongue images for feature extraction and diagnostics. IEEE Trans. Image Process. 22(12), 5336–5347 (2013)
Article PubMed ADS Google Scholar
Li, Z., Yu, Z., Liu, W., et al.: Tongue image segmentation via color decomposition and thresholding. In: 2017 4th International Conference on Information Science and Control Engineering (ICISCE). IEEE Computer Society (2017). https://doi.org/10.1109/ICISCE.2017.161
Yao, S.L., Zhang, Z.Z., Yang, X.S., Xu, X., Cao, J., Xie, G.Y., Zhang, Q.: Analysis of composite traditional Chinese medicine constitution: an investigation of 974 volunteers. J. Chin. Integr. Med. 10(5), 508–515 (2012)
Article Google Scholar
Guanlong, L., Yishuan, H., Qi, Z., et al.: The study of auxiliary TCM constitution identification model based on tongue image and physical features. Lishizhen Med. Mater. Med. Res. 30, 244–246 (2019)
Google Scholar
Wang, X., Zhang, B., Yang, Z., et al.: Statistical analysis of tongue images for feature extraction and diagnostics. IEEE Trans. Image Process. 22, 5336–5347, 2013 informatics (2018)
Li, H., Wen, G., Zeng, H.: Natural tongue physique identification using hybrid deep learning methods. Multimedia Tools Appl. 78(6), 6847–6868 (2019)
Article Google Scholar
Ma, J., Wen, G., Wang, C., et al.: Complexity perception classification method for tongue constitution recognition. Artif. Intell. Med. 96, 123–133 (2019)
Article PubMed Google Scholar
Wen, G., Ma, J., Hu, Y., et al.: Grouping attributes zero-shot learning for tongue constitution recognition. Artif. Intell. Med. 09(2):101951 (2020). https://doi.org/10.1016/j.artmed.2020.101951
Qiu, T.: Tongue identification for small samples based on meta learning. In: 2020 International Conference on Computer Information and Big Data Applications (CIBDA) (2020). https://doi.org/10.1109/CIBDA50819.2020.00073
Yuan, Y., Liao, W.: Design and implementation of the traditional Chinese medicine constitution system based on the diagnosis of tongue and consultation. IEEE Access 9, 4266–4278 (2021)
Article Google Scholar
Li, H., Xu, B., Wang, N., et al.: Deep convolutional neural networks for classifying body constitution.Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44781-0_16
Li, T., Ma, Y., Wu, C.: Multi-label constitution identification based on tongue image in traditional Chinese medicine. In: 2021 China Automation Congress, pp. 1617–1622 (2021)
Huan, E.Y., Wen, G.H.: Multilevel and multiscale feature aggregation in deep networks for facial constitution classification. Comput. Math. Methods Med. (2019)
Peng, P., Tian, Y., Xiang, T., et al.: Joint semantic and latent attribute modelling for cross-class transfer learning. IEEE Trans. Pattern Anal. Mach. Intell. 40(7), 1625–1638 (2018)
Article PubMed Google Scholar
Li, Y., Zhang, J., Zhang, J., et al.: Discriminative learning of latent features for zero-shot recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE (2018). https://doi.org/10.1109/CVPR.2018.00779
China Association of Chinese Medicine. Classification and Judgment of TCM Constitution. China Association of Chinese Medicine, Beijing, China (2009)
Huang, H., Yu, A., Chai, Z., He, R., Tan, T.: Selective wavelet attention learning for single image deraining. Int. J. Comput. Vis. 129(4), 1282–1300 (2021)
Article Google Scholar
Zhao, X., Huang, P., Shu, X.: Wavelet-attention CNN for image classification. Multimedia Syst. 28(3), 915–924 (2022)
Article Google Scholar
Xin, Q., Hu, S., Liu, S., Zhao, L., Zhang, Y.-D.: An attention-based wavelet convolution neural network for epilepsy EEG classification. IEEE Trans. Neural Syst. Rehabil. Eng. 30, 957–966 (2022)
Article PubMed Google Scholar
Hu, J., Ding, Y., Kan, H.: Tongue body constitution classification based on machine learning (in Chinese). J. Jiamusi Univ. (Nat. Sci. Ed.) 36(5), 709–713 (2018)
Google Scholar
Wang, G., Zhao, Y., Tang, C., et al.: When shift operation meets vision transformer: an extremely simple alternative to attention mechanism (2022). arXiv preprint. arXiv:2201.10801
Yin, M., Yao, Z., Cao, Y., et al.: Disentangled non-local neural networks. In: European Conference on Computer Vision, pp. 191–207 (2020)
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Wang, Q., Wu, B., Zhu, P., et al.: ECA-Net: efficient channel attention for deep convolutional neural networks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2020). https://doi.org/10.1109/CVPR42600.2020.01155
Han, K., Wang, Y., Chen, H., et al.: A survey on vision transformer (2020). https://doi.org/10.1109/TPAMI.2022.3152247
Woo, S., Park, J., Lee, J.Y., et al.: Cbam: convolutional block attention module. In: European Conference on Computer Vision, pp. 3–19 (2018)
Li, X., Wang, W., Hu, X., et al.: Selective kernel networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 510–519 (2019)
Hu, Y., Wen, G., Luo, M., et al.: Inner-imaging networks: put lenses into convolutional structure. Inst. Electr. Electron. Eng. (IEEE) (2021). https://doi.org/10.1109/TCYB.2020.3034605
Wang, F., Jiang, M., Qian, C., et al.: Residual attention network for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2017)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16 x 16 Words: transformers for image recognition at scale (2020). https://doi.org/10.48550/arXiv.2010.11929
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv preprint. arXiv:1409.1556
Huang, G., Liu, Z., Van Der Maaten, L., et al.: Densely connected convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Zagoruyko, S., Komodakis, N.: Wide residual networks. In: Proceedings of the British Machine Vision Conference (BMVC), pp. 1–9. British Machine Vision Association (2016)
Xie, S., Girshick, R., Dollár, P., et al.: Aggregated residual transformations for deep neural networks. In: IEEE (2016). https://doi.org/10.1109/CVPR.2017.634
Gao, S.H., Cheng, M.M., Zhao, K., et al.: Res2net: a new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 43(2), 652–662 (2019)
Article Google Scholar
Hu, Y., Wen, G., Chapman, A., Yang, P., Luo, M., Xu, Y., Dai, D., Hall, W.: Graph-based visual-semantic entanglement network for zero-shot image recognition. IEEE Trans. Multimedia (2021). https://doi.org/10.1109/TMM.2021.3082292
Li, P., Yu, H., Luo, X., Wu, J.: LGM-GNN: a local and global aware memory-based graph neural network for fraud detection. IEEE Trans. Big Data 9(4), 1116–1127 (2023)
Article Google Scholar
Yu, H., Lu, J., Xu, J., et al.: A hybrid incremental regression neural network for uncertain data streams. Int. Joint Conf. Neural Netw. (2019). https://doi.org/10.1109/IJCNN.2019.8852364
Yu, H., Liu, W., Lu, J., Wen, Y., Luo, X., Zhang, G.: Detecting group concept drift from multiple data streams. Pattern Recognit. 134, 109113 (2023)
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for providing helpful comments.

Funding

This study was supported by National Natural Science Foundation of China (Grant nos. 62176095 and 62006049), Basic and Applied Basic Research Foundation of Guangdong Province (Grant no. 2023A1515010939), Guangdong Province Key Area R &D Plan Project (Grant no. 2020B1111120001), and Project of Education Department of Guangdong Province (Grant no. 2022KTSCX068).

Author information

Authors and Affiliations

School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
Guihua Wen, Haozan Liang & Pengcheng Wen
School of Computer Science, Guangdong Polytechnic Normal University, Guangzhou, China
Huihui Li
Guangdong Traditional Medical and Sports Injury Rehabilitation Research Institute, Guangdong Second Provincial General Hospital, Guangzhou, China
Rui Chen & Cheng Li

Authors

Guihua Wen
View author publications
You can also search for this author in PubMed Google Scholar
Haozan Liang
View author publications
You can also search for this author in PubMed Google Scholar
Huihui Li
View author publications
You can also search for this author in PubMed Google Scholar
Pengcheng Wen
View author publications
You can also search for this author in PubMed Google Scholar
Rui Chen
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Experimental protocol design are prepared by GW and HL. Experiments are conducted by HL and PW. Results analysis and discussion were performed by all authors. The first draft of the manuscript was written by CL and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Huihui Li or Cheng Li.

Ethics declarations

Conflict of Interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wen, G., Liang, H., Li, H. et al. Interpretable Tongue Constitution Recognition via Reshaped Wavelet Attention. Int J Comput Intell Syst 17, 31 (2024). https://doi.org/10.1007/s44196-024-00402-1

Download citation

Received: 17 January 2023
Accepted: 02 January 2024
Published: 14 February 2024
DOI: https://doi.org/10.1007/s44196-024-00402-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Interpretable Tongue Constitution Recognition via Reshaped Wavelet Attention

Abstract

Similar content being viewed by others

Natural tongue physique identification using hybrid deep learning methods

Tongue Color Classification Based on Convolutional Neural Network

Classification of Tongue Color Based on Convolution Neural Network

Explore related subjects

1 Introduction

2 Related Work

2.1 Questionnaire Methods

2.2 Traditional Machine Learning Methods

2.3 Deep Neural Network Methods

2.4 Reshaped Wavelet Attention

3 Domain Knowledge

4 Wavelet Attention

5 Reshape Fusion

6 Proposed Constitution Recognition Method

7 Experimental Results

7.1 Databases

7.2 Implementation Details

7.3 Ablation Studies

7.4 Compared with Constitution Recognition Methods

7.5 Compared with Recent Attention Models

7.6 Compared with Recent Hybrid Deep Learning Methods

7.7 Visualization Analysis for Our Method

8 Conclusion

Availability of Data and Material

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation