1 Introduction

Traditional Chinese Medicine (TCM) regards that the individual’ constitution refers to the stable internal features formed by both the innate inheritance and habits acquired in the process of life, including morphology, structure, physiological, and psychological states. Thus the individual’s constitution reflects its current physical state and the future development trend of the health. It is the basis for diagnosing, treating, and preventing diseases [2, 3]. Consequently the constitution recognition is not only suitable for patients, but also for healthy population who can early understand its health status and then prevent diseases [4]. Constitution identification can be realized through four diagnosis methods in TCM, which are looking, listening, asking, and feeling the pulse [9]. However, they require the rich clinical experience of doctors [1, 6]. Thus modern technology has been used to realize the auxiliary constitution identification, where the constitution is defined as nine types by Chinese society of traditional Chinese medicine [10, 52]. They are Qi-deficiency, Yang-deficiency, Yin-deficiency, Phlegm-dampness, Damp-heat, Blood-stasis, Qi-depression, Special-diathesis, and Gentleness [5]. The early methods are based on the constitution questionnaire scale [7, 8]. They provide a questionnaire for the individual to choose answers and then calculate the score of each constitution type, by which the constitution type is determined. Because there are lots of questions in the scale, it takes a long time to obtain all answers. There are methods such as decision tree which focus on reducing the number of questions [11]. Another issue is that results are easily influenced by the individual subjectivity when answering questions. Furthermore, they may not understand some questions, leading to choose error answers. Besides questionnaire scale, there are some other methods. For example, the fuzzy linguistic variables are combined with the judgment results of TCM doctors to form new samples for classifying individuals’ constitution [12]. The physical examination indexes such as blood routine indexes and urine routine indexes are for TCM constitution types [13]. The data mining methods such as association rules are used for TCM constitutions [14]. Besides, the face images, voice and pulse signals are also applied [15, 20, 21].

By comparison, the tongue images are more widely used for the recognition of individual health and disease status [22], classification of shapes and TCM syndromes [23, 25]. They are effective non-invasive technique that can be used to assess the health status of patients [26]. Changes in the internal organs of the body are usually shown on the tongue, such as the texture and color. Therefore, tongue images can be used to help explore the physiological functions and pathological changes of the human body. However, the tongue diagnosis by doctors requires the face-to-face communication, which greatly depends on doctors’ experience and lacks the objective and quantitative judgment rules. Recently, automated tongue diagnosis methods have been proposed to solve this problem [1, 2732]. They usually include the tongue image acquisition, tongue image segmentation, and tongue image classification. Tongue image acquisition is the first step of computerized tongue diagnosis, as the quality of tongue images has an important impact on labeling and analyzing each tongue image [33, 34]. The tongue image segmentation aims to effectively filter the interference of the background information and then help improve the following classification performance [28, 29]. Tongue image classification can be regarded as the common image classification. Early tongue constitution recognition methods use traditional image processing methods to extract features, such as the color, texture, and shape features [10, 16, 30, 35]. For example, the color features are extracted in HSV [36], Lab [41], and HSI color space [38] respectively. The texture features can be also extracted [39]. These features can be also combined with body features [40]. After features are extracted, these methods use the traditional machine learning methods to perform the constitution classification [10]. However, these methods rely on manually designed features [37]. Due to the limitation of human experience and professional knowledge, the designed features may be incomplete, interrelated, and redundant, easily resulting in the poor performance. In recent years, deep neural networks have been used to extract features of the tongue images for the constitution recognition [31, 32]. For example, convolution neural network, gray level co-occurrence matrix, minimum bounding rectangle and edge curve are combined to extract tongue image features, and then classify them into one of constitution types [24]. The hybrid deep learning method is also applied to recognize the constitution through the tongue images. It uses the lightweight convolution network to complete the initial tongue detection, and then uses another calibration network to find the refined area, so as to better recognize the constitution type [42]. Furthermore, a novel method varying with the complexity of samples is also proposed to improve the accuracy of constitution classification [43]. To overcome problems of the class imbalance and small samples, the prototype network [45] and novel method based on the zero sample learning [44] are proposed.

Although these methods have made in progress, they still have difficulties to extract discriminative features. This is because they do not extract multi-level features, leading to features without diversity. They also fail to adaptively fuse features from different levels, resulting in the incomplete features. To solve these issues, this paper proposes a novel tongue constitution recognition method based on the reshaped wavelet attention (RWA). The main contributions are as follows:

  1. (1)

    The wavelet attention is applied to obtain multi-scale features through discrete wavelet transform and then the attention mechanism is used to weight them.

  2. (2)

    The reshaping mechanism is proposed to construct the high dimensional space composed of features from different levels, where the association rules are mined and then used to fuse features efficiently.

  3. (3)

    The wavelet attention and reshaping mechanism are integrated into convolution neural network to create the more accurate attributes by which the tongue constitution recognition can be performed with the higher performance and better interpretability.

Section 2 introduces the related work. Section 3 presents the domain knowledge. Section 4 introduces the wavelet attention while the reshape fusion is presented in Sect. 5. The new method is proposed in Sect. 6. Experimental results are presented in Sects. 7 and 8 presents conclusions.

2 Related Work

As our method performs the constitution recognition via tongue images, the related methods for the constitution recognition will be compared. As our method is also related to wavelet attention, they will be also analyzed.

2.1 Questionnaire Methods

The questionnaire methods are mostly used for the constitution recognition [7, 8, 17], which follow judgment criterion of TCM constitution [10, 52]. In order to reduce the number of questions in the questionnaire, the decision tree has been applied [11]. Particularly, questionnaire can be dynamically formed according to individual healthy state [17]. The questionnaire scale can be also combined with the tongue image to further improve the performance [46]. These methods are much simple to implement and accurate if all questions are answered correctly. However, results are easily influenced by individual subjective attitudes. Besides, the examinee may not understand some questions so that his answers may be error.

2.2 Traditional Machine Learning Methods

Traditional machine learning methods have been applied to perform the constitution recognition based on tongue images [10]. They differ in the used features, including color, texture, and shape features [10, 16, 35]. There are some methods that use both tongue features and body features [40], including color features in HSV color space [36], Lab color space [41], and HSI color space [38], while the texture features are also used [39]. The fuzzy linguistic variables for tongue images are combined with the judgment results by several TCM doctors to form the database for classifying individuals’constitution [12]. Besides, the association rules within the cloudy framework are mined to classify TCM constitutions [14]. Besides tongue images, face images, voice signals, pulse signals, and physical examination indexes such as blood routine and urine routine indexes are also applied for the constitution recognition [13, 15, 18, 19]. After features are extracted, traditional machine learning methods are used to perform the constitution classification. However, these methods rely on manually designed features [37], which may be incomplete, interrelated, and redundant due to the limitation of human experience and professional knowledge, easily resulting in the poor performance. Our method can overcome these shortcomings by automatically learning features from the tongue images.

2.3 Deep Neural Network Methods

Deep neural networks have been applied to realize the constitution recognition [47]. For example, Inception-v3 model is used to classify the constitution with nine types [40], which has 208 tongue images for training. Another method uses the convolution neural network to extract features of tongue images, having three categories and 483 tongue images for training [24]. A larger database is ever applied to recognize the constitution, where the tongue detection is also performed [47]. Another better method is based on the complexity perception of tongue images [43], which performs the constitution recognition for the testing tongue image by selecting the classifier with the suitable complexity. This idea has been validated in the latter method [48]. Besides tongue images, voice signals and pulse signals are also used in convolutional neural networks to realize the constitution recognition [47]. Furthermore, the face images are also used for the constitution classification, where multilevel and multi-scale features aggregation method within the convolutional neural network are used [49]. As tongue images are not easy to collect, leading to the smaller training data that is insufficient for deep learning methods [46], so that zero-shot learning methods can be considered [50, 51, 72]. In addition, new deep neural network architecture can be adapted to deal with our issue that fuses and utilizes both local and global information simultaneously [73]. There are also novel methods proposed to efficiently deal with problems of uncertainty and concept drift [74, 75]. However, these methods are not directly suitable for the constitution recognition. There is a method that uses domain knowledge and latent attributes to recognize the constitution [44]. Unlike our method, these methods cannot extract multi-scale features and multi-level features simultaneously through discrete wavelet transformation.

2.4 Reshaped Wavelet Attention

Recently wavelet transformation has been applied to design new neural networks with improved performance. For example, attention mechanism-based wavelet convolution neural network has been proposed for EEG classification [55]. It firstly uses multi-scale wavelet analysis to decompose the input EEG into lots of components with different frequency bands, which are then input into the network with an attention mechanism to extract features for the classification. Another selective wavelet attention method learns a series of wavelet attention maps to guide the separation of rain and background information in both spatial and frequency domains [53]. Specifically, a new wavelet-attention block is designed which implements attention in the high-frequency domain [54]. However, unlike our method, these methods are not for tongue constitution recognition. They do not exploit the domain knowledge with the reshaped wavelet attention.

3 Domain Knowledge

In the clinical tongue diagnosis, a doctor may first observe the patient’s tongue image, such as tongue color, tongue shape, tongue quality, etc., and then determine the its attributes. Finally, the doctor determine the constitution of patient according to the relationship between constitution type and attributes of the tongue image. According to Chinese national standard [52], attributes of the tongue image for each constitution type are different [44]. They have been summarized in Table 1. Because Special-diathesis constitution is much complicated which cannot be accurately described, we can only use “others” as its semantic attributes. In addition, these attributes are semantically grouped: tongue color, tongue body, and tongue nature. In this way, each constitution type of the tongue image can be represented by a fifteen dimensional semantic vector \((A_1,\ldots , A_{15})\) where attributes are encoded in the one-hot mode, that is, 1 indicates that the constitution has corresponding attributes and 0 indicates that the constitution has no corresponding attributes. In order to facilitate the following description, we define the relationship between attributes of the tongue image and its corresponding constitution as the attribute matrix \(w_{\text{attribute}}\in {\mathbb{R}}^{15\times 9},\) where each constitution type corresponds to a specific attribute vector. As the priori domain knowledge, it can be used to help neural network model to predict the constitution type more accurately.

Table 1 Attributes of tongue image for each constitution type

4 Wavelet Attention

In order to mine the multiple scales features, the wavelet attention framework is proposed in Fig. 1. Given the input features with dimension \(H\times W\times C,\) four components with dimension \(H/2\times W/2\times C\) can be obtained after two-dimensional discrete wavelet (DWT) transform is performed. These feature components are applied to obtain the corresponding attention mask \(M_s\in R^{\frac{H}{2}\times \frac{W}{2}}\) through spatial attention and position normalization, which are then weighted and concatenated according to the attention mask to aggregate the output features \({X}'\in R^{\frac{H}{2}\times \frac{W}{2}\times 4C}.\)

Fig. 1
figure 1

Wavelet attention framework to mine the multiple scales features

Features from two-dimensional discrete wavelet transform: It decomposes the input data into various components with different frequencies, while the spatial dimension is just half of the original one. Thus it can replace the down sampling operation such as maximum pooling and mean pooling in convolutional neural networks. The proposed wavelet attention by DWT can be described as follows:

$$\begin{aligned} X {\mathop {\longrightarrow }\limits ^{{\text{DWT}}}} \{X_{LL},X_{LH},X_{HL},X_{HH}\} \in {\mathbb{R}}^{ \frac{H}{2} \times \frac{W}{2} \times C} \end{aligned}$$
(1)

where \(X \in {\mathbb{R}}^{W\times H \times C}\) is the input features, \(X_{LL}\) denotes the low-frequency component that retains the main information of the original feature, while \(X_{HH}\) denotes the high-frequency component that often contains noise or texture information. \(X_{LH}\) and \(X_{HL}\) denotes components with compound frequencies. As geometric and texture information is needed for tongue constitution recognition task, the high-frequency components should be also considered. Consequently the wavelet attention module adaptively aggregates features from these four components to provide features as comprehensive as possible for the tongue constitution recognition.

Features aggregation based on spatial attention and position normalization: In order to better aggregate four kinds of features decomposed by discrete wavelet transform, spatial attention (SA) and position normalization (PN) operations are proposed. Spatial attention focuses on the important position of features so as to emphasize important features and suppress unnecessary features. Because the tongue constitution recognition task is closely related to spatial features of the tongue image, wavelet attention adopts the spatial attention mechanism by four different attention masks. Furthermore, in order to consider the different contribution of different frequency to aggregated features, position normalization is proposed to normalize the weight of each spatial position in the attention mask.

Given \(\{X_{LL},X_{LH},X_{HL},X_{HH}\}\) as inputs, the wavelet attention module infers the spatial attention mask for each component, normalizes its spatial position, finely adjusts its importance, and then aggregates the finely tuned features. As shown in Fig. 1, this process can be simply summarized as follows:

$$\begin{aligned}{} {} \{X_{LL},X_{LH},X_{HL},X_{HH}\}{\mathop {\longrightarrow }\limits ^{{\text{SA}}}} \{M_S^{LL},M_S^{LH},M_S^{HL},M_S^{HH}\} \end{aligned}$$
(2)
$$\begin{aligned}{} {} \{M_S^{LL},M_S^{LH},M_S^{HL},M_S^{HH}\}{\mathop {\longrightarrow }\limits ^{{\text{PN}}}} \{ {\bar{M}}_S^{LL},{\bar{M}}_S^{LH},{\bar{M}}_S^{HL},{\bar{M}}_S^{HH}\} \end{aligned}$$
(3)
$$\begin{aligned}{} {} {X}'_{AB} ={\bar{M}}_S^{AB} \otimes X_{AB},A \in \{L,H\},\quad B \in \{L,H\} \end{aligned}$$
(4)
$$\begin{aligned}{} {} {X}'=[{X}'_{LL} {X}'_{LH} {X}'_{HL} {X}'_{HH}] \end{aligned}$$
(5)

where \({\bar{M}}_S^{AB}\) is the attention mask and \({X}'\) is the output features after aggregation. As shown in the upper right corner in Fig. 1, the spatial attention mask \(M_S\in {\mathbb{R}}^{H\times W}\) is generated by the spatial attention for the input F through:

$$\begin{aligned} M_S=f_{3\times 3}(f_{\text{AVG}}(F)) \end{aligned}$$
(6)

where \(f_{\text{AVG}}\) represents the mean pooling operation, and \(f_{3\times 3}\) represents the convolution operation with the convolution kernel size of \(3\times 3.\) Because there is no relationship between four attention masks, we propose a position normalization operation to learn the relationship between them, aiming to dynamically adjust the weight for each attention mask and learn their complementarity. It is illustrated in the lower right corner in the figure. Given the input attention mask \(M_S,\) its weight in spatial coordinates is computed as follows:

$$\begin{aligned} {\bar{M}}_S^{AB(h,w)}=\frac{{\text{e}}^{M_S^{AB(h,w)}}}{\sum _{T\in \{LL,LH,HL,HH\}} {\text{e}}^{M_S^{T(h,w)}}} \end{aligned}$$
(7)

where \(M_S^{AB(h,w)}\) represents the weight of the input attention mask in spatial coordinates (hw),  and \({\bar{M}}_S^{AB(h,w)}\) represents the weight of the output attention mask in the same spatial coordinates. After generating the attention mask and normalizing the position of the attention mask, features from four components are aggregated by the simple concatenation operation to superimpose the input features that increase the diversity. This is because features obtained by the two-dimensional discrete wavelet transform have different frequencies and obviously do not belong to the same distribution.

5 Reshape Fusion

Wavelet attention can improve the ability of convolutional neural network to extract multi-scale features for the tongue image, but it cannot extract multi-level features efficiently. This is because concepts of the scale and level are different. The scale refers to the grain size or spatial resolution. The level refers to the degree of semantics abstract. The wavelet attention can automatically select the optimal spatial resolution. However, it cannot easily change the semantic level of features, as it works only on features of the given semantic level. In the deep neural network, the features in the shallow levels tend to represent the details of geometry and texture, while the deeper features tend to represent more abstract semantic information. When fusing features of different levels, their different contribution should be considered. Thus an innovative reshape fusion method is proposed, which dynamically integrates multi-level features from different network layers, enhances the importance of key features and reduces irrelevant features. In more details, it first obtains a one-dimensional aggregation features by concatenating multiple level features, rearranges them into three-dimensional space using the reshaping operation, and then learn the relationship between reshaped features. Subsequently, it uses the inverse reshaping operation to obtain the one-dimensional relationship mask from the obtained three-dimensional relationship mask, which can be further applied to weight the one-dimensional aggregation features. Finally, the contribution degree of different features is weighted according to the one-dimensional relationship mask.

Fig. 2
figure 2

Reshape and inverse reshape operation for feature fusion

Reshape operation: As shown in Fig. 2, features \(F_C \in {\mathbb{R}}^{ C}\) is first converged by concatenating operation:

$$\begin{aligned}{} F_C=[F_1\oplus F_2\oplus F_3\oplus F_4] \in {\mathbb{R}}^{ C} \end{aligned}$$
(8)
$$\begin{aligned} C=C_1+ C_2+ C_3+ C_4 \end{aligned}$$
(9)

where \(F_i \in {\mathbb{R}}^{ C_i}\) is the input one-dimensional features, \(C_i\) is the number of features, \(\oplus\) representing the concatenation operation. Next, the reshape operation rearranges features \(F_C \in {\mathbb{R}}^{ C}\) along the spatial dimension to obtain the reshaped features \(F_R \in {\mathbb{R}}^{ H\times W \times K},\) where HWK are the height, weight, the number of channels, and \(C=H\times W\times K.\)

$$\begin{aligned}{} & {} F_R=\phi _R(F_C)=\phi _R([u_1,u_2,\ldots ,u_C]=[f_1,f_2,\ldots ,f_K] \\{} &\quad =\left[ \begin{array}{c} \left[ \begin{array}{ccc} u_1&{}\ldots &{}u_W\\ \ldots &{}\ldots &{}\ldots \\ u_{(H-1)\times W+1} &{}\ldots &{} u_{H\times W} \end{array}\right] \\ \left[ \begin{array}{ccc} u_{H\times W+1}&{}\ldots &{}u_{(H+1)\times W}\\ \ldots &{}\ldots &{}\ldots \\ u_{(2H-1)\times W+1} &{}\ldots &{}u_{2H\times W} \end{array}\right] \\ \left[ \begin{array}{ccc} u_{C-H\times W+1}&{}\ldots &{}u_{C-(H-1)\times W}\\ \ldots &{}\ldots &{}\ldots \\ u_{C-W+1}&{}\ldots &{} u_{C} \end{array}\right] \end{array}\right] \in {\mathbb{R}}^{H\times W\times K} \end{aligned}$$
(10)

where \(\phi _R\) represents the reshaping function, \(u_i\) represents each component of feature, and \(f_i\) represents the feature map. The reshape operation does not introduce additional parameters that need to be learned, so it does not increase the number of parameters.

Relationship interaction learning: Different features have different contributions to tasks. As shown in Fig. 2, relationship interaction learning is proposed to learn the relationship weights between different features. It models 3D relationship mask \(M_{3D}\in {\mathbb{R}}^{H\times W\times K}\) between reshaped features \(F_R\) as follows:

$$\begin{aligned}{} & {} M_{3D}=F_R\circledast \theta _{3\times 3} =[f_1,f_2,\ldots ,f_K]\circledast \theta _{3\times 3} \\ &\quad= \left[ \begin{array}{c} \left[ \begin{array}{ccc} v_1&{}\ldots &{}v_W\\ \ldots &{}\ldots &{}\ldots \\ v_{(H-1)\times W+1} &{}\ldots &{} v_{H\times W} \end{array}\right] \\ \left[ \begin{array}{ccc} v_{H\times W+1}&{}\ldots &{}v_{(H+1)\times W}\\ \ldots &{}\ldots &{}\ldots \\ v_{(2H-1)\times W+1} &{}\ldots &{}v_{2H\times W} \end{array}\right] \\ \left[ \begin{array}{ccc} v_{C-H\times W+1}&{}\ldots &{}v_{C-(H-1)\times W}\\ \ldots &{}\ldots &{}\ldots \\ v_{C-W+1}&{}\ldots &{} v_{C} \end{array}\right] \end{array}\right] \in {\mathbb{R}}^{H\times W\times K} \end{aligned}$$
(11)

where \(\circledast\) represents the convolution operation and \(\theta _{3\times 3}\) represents convolution kernel with scale. \(M_{3D}\) is a learned three-dimensional relationship mask, \(v_i\) corresponds to the adaptive weight of the element \(u_i\) in \(F_R.\) When extracting the relationship between different features, relationship interaction learning uses the convolution operation rather than the full connection operation, which can reduce the amount of parameters. For features with the input dimension \(C=H\times W\times K,\) the number of parameters for two methods are \(K\times 3\times 3\times K=9K^2\) and \((HWK)\times (HWK)=H^2W^2K^2\) respectively. Obviously, our method uses the less parameters. In addition, the full connection operation cannot encode the spatial neighborhood relations between features, while the convolution operation extracts the relations from multiple different spatial neighborhoods. Thus the internal relationships of multiple different neighborhoods are considered simultaneously, obtaining the more fine relationship between features.

Reverse reshape operation: Because the input features \(F_C\) belong to one-dimensional space, whereas the three-dimensional relationship mask \(M_{3D}\) cannot directly perform element by element multiplication with them to adjust the contribution weight of each feature, an inverse reshape (IR) operation \(\phi _{IR}\) is proposed:

$$\begin{aligned} M_{1D}=\phi _{IR}(M_{3D}) =[v_1,v_2,\ldots ,v_C]\in {\mathbb{R}}^C. \end{aligned}$$
(12)

Subsequently, the input features can be scaled by multiplying the one-dimensional relationship mask as follows:

$$\begin{aligned} {F}'=F_C\otimes \sigma (M_{1D})\in {\mathbb{R}}^C \end{aligned}$$
(13)

where \(\otimes\) represents element by element multiplication operation and \(\sigma\) is defined as the sigmoid function.

6 Proposed Constitution Recognition Method

The constitution recognition based on the reshaped wavelet attention (RWA) through the tongue image is proposed whose framework is shown in Fig. 3. It is an end-to-end learning framework, taking the tongue image as the input and the predicted constitution type as the output. RWA integrates the wavelet attention (WA) and reshape fusion (RF) into the given convolutional neural network such as ResNet18. For the input tongue image, our method uses multi-stage convolution layers to extract features, wavelet attention to augment features, and reshape fusion to automatically fuse features. Subsequently, RWA maps the obtained tongue image features to the latent semantic attributes space to obtain the predicted attribute vector. Finally, the distance between the predicted attribute vector and the true attribute vector is calculated, and outputs the constitution type with the minimum distance. RWA simulates the diagnosis process of the doctor, so that it greatly improves the performance of the constitution recognition and is stable, accurate, rapid and interpretable. Semantics of attributes for each constitution type are certain and can be easily understood by doctors. As illustrated in Fig. 3, when our method predicts the constitution type of the tongue image, it also predicts its attributes. The predicted attributes provide for the interpretation. On the other hand, the extracted features in the different layer of the backbone network are of different semantic levels. The association knowledge among them is mined and hierarchized by the reshape fusion module as illustrated as Fig. 2.

Fig. 3
figure 3

Framework of the tongue constitution recognition based on the wavelet attention and reshape fusion

In the framework, RWA takes ResNet18 as the backbone network, having five stages [59]. Except for the first stage, all other stages are composed of residual blocks with different output feature scales. Let \(\phi _{i}\) represents the residual block of the \(i^{th}\) stage, given the input tongue image \(X \in {\mathbb{R}}^{H\times W\times C},\) features of all stages will be obtained through multiple residual blocks, denoted as \(\{X_1,\ldots ,X_5\}.\) These features are decomposed, weighted and aggregated by wavelet attention, and then fused to obtain multiple scale features denoted as \(\{F_1,\ldots ,F_5\}.\) Subsequently, they are further reshaped to obtain the final tongue image features \({F}' \in {\mathbb{R}}^{C}\) through reshaping fusion operation. The above process can be described as:

$$\begin{aligned} X_1=\phi _{1}(X) \end{aligned}$$
(14)
$$\begin{aligned} X_i=\phi _{i}(X_{i-1}),\quad i\in \{2,3,4,5\} \end{aligned}$$
(15)
$$\begin{aligned} {X}'_1=f_{1\times 1}(\phi _{\text{WA}}(X_i)),\quad i\in \{2,3,4\} \end{aligned}$$
(16)
$$\begin{aligned} F_1=f_{\text{GAP}}(X_2) \end{aligned}$$
(17)
$$\begin{aligned} F_i=f_{\text{GAP}}({X}'_i+X_{i+1}),\quad i\in \{2,3,4\} \end{aligned}$$
(18)
$$\begin{aligned} {F}'=\phi _{\text{RF}}(F_1,F_2,F_3,F_4) \end{aligned}$$
(19)

where \(f_{1 \times 1}\) represents the convolution operation, \(\phi _{\text{WA}}\) represents the wavelet attention, \(f_{\text{GAP}}\) represents the global average pooling, and \(\phi _{\text{RF}}\) represents the reshape fusion operation. Subsequently, the predicted attribute vector \({\hat{F}}_{\text{Attribute}} \in {\mathbb{R}}^{C}\) can be obtained by a multi-layer perceptron (MLP) with one hidden layer as follows:

$$\begin{aligned} {\hat{F}}_{\text{Attribute}} =MLP({F}')={F}'\times W_A \end{aligned}$$
(20)

where \(W_{A} \in {\mathbb{R}}^{C\times 15}\) is the learnable parameters of MLP. Now we calculate the similarity between the predicted attribute vector and the true attribute vector for each constitution, and then output the predicted probabilities for each constitution \({\hat{Y}} \in {\mathbb{R}}^{1\times 9}\) according to the similarity. In order to speed up the recognition, the calculation process is simplified as follows:

$$\begin{aligned} {\hat{Y}}= \text{softmax}({F}'_{\text{Attribute}}\times W_{\text{Attribute}}) \end{aligned}$$
(21)
$$\begin{aligned}= \text{softmax}([{\hat{Y}}_1,{\hat{Y}}_2,\ldots ,{\hat{Y}}_9]) \end{aligned}$$
(22)
$$\begin{aligned} {\hat{Y}}_i= \frac{{\text{e}}^{{\hat{y}}_i}}{\sum _{j=1}^9{\text{e}}^{{\hat{y}}_j}} \end{aligned}$$
(23)
$$\begin{aligned} {\hat{y}}_i= {\hat{F}}_{\text{Attribute}}\odot (W^i_{\text{Attribute}})^{\text{T}}. \end{aligned}$$
(24)

The tongue constitution recognition task can be regarded as the task of image classification, so that we define the constitution classification loss as the cross entropy loss:

$$\begin{aligned} {\mathcal {L}}_{\text{cls}}=-Y \cdot \log ({\hat{Y}})=-\sum _{i=1}^9Y_i\cdot \log ({\hat{Y}}_i) \end{aligned}$$
(25)

where \(Y\in {\mathbb{R}}^{1\times 9}\) represents the true constitution type vector of the input tongue image and \(Y_i\) represents the probability of the constitution type i. If i is the true constitution type of the input tongue image, \(Y_i=1\) else \(Y_i=0.\)

In order to better constrain the predicted attributes \({\hat{F}}_{\text{Attribute}}\in {\mathbb{R}}^{1\times 15}\) closer to the true attributes of the input tongue image, we further introduce the attribute embedding loss to shorten the distance between the predicted attributes and true attributes.

$$\begin{aligned} {\mathcal {L}}_{\text{AE}}=\sum _{j=1}^9 \Vert {\hat{F}}_{\text{Attribute}}-(W^j_{\text{Attribute}})^{\text{T}}\Vert ^2, Y_j=1 \end{aligned}$$
(26)

where \(W^j_{\text{Attribute}}\) represents the true attribute vector corresponding to the given constitution type. Thus the total loss is defined as

$$\begin{aligned} {\mathcal {L}}_{\text{total}}={\mathcal {L}}_{\text{cls}}+{\mathcal {L}}_{\text{AE}}. \end{aligned}$$
(27)

7 Experimental Results

Lots of experiments are carried out on the tongue images for the constitution recognition task to evaluate the proposed RWA in effectiveness, efficiency, and interpretability.

7.1 Databases

As a large number of tongue images with labelled constitution types cannot be available, we construct a tongue image constitution database which is composed of tongue images and their constitution types. They are from outpatient departments of hospitals using the camera to collect the tongue images of patients. In the collected images, the effective tongue area only occupies a small part, and the other information outside the tongue is interference information for the tongue diagnosis. In order to extract the tongue region, we uses the objective detection model [42] to separate the effective tongue image from the background, where only small part of the background information are contained.

Fig. 4
figure 4

Distribution of tongue constitution types in the tongue image constitution database that contains 46,753 images with nine types

As shown in Fig. 4, the constructed database contains 46,753 images with nine types, among which 80% is used for training and validation, and the remaining is for testing. Specifically, both training set and validation set contain 37,398 samples, while the test set has 9355 samples.

7.2 Implementation Details

In order to make fair comparison, we use the deep learning framework pytorch and the model library Timm to implement all compared methods. As methods such as VIT-small [57], VIT-small-pretrained [57], VIT-base [57], VIT-base-pre trained [57], Shift-S [58] and Shift-S-pretrained [58] are difficult to converge under the other hyper-parameters, they follow the optimal implementation details in the original paper [57, 58]. Besides, our method RWA and the other compared methods adopt the same super-parameters to ensure the fairness as much as possible. In the training, the random gradient descent algorithm with the weight attenuation of \(1\text{e}{-}4,\) momentum of 0.9, and batch size of 64 is used to learn parameters. The total training rounds of the model are 300. The learning rate starts from 0.1 and decays to \(\frac{1}{10}\) of the last value every 50 rounds. When training the model, only the general data augmentation method is used, that is, the size of the input image is scaled to \(224 \times 224\) and then 4 pixels are filled around the image edge in a symmetrical way. Subsequently, the image with the size of \(224 \times 224\) is cut out randomly and then flipped at the probability of 0.5. During the test phase, the image is scaled to \(224 \times 224.\) In addition, the training and test images are normalized by subtracting the mean and then divided by the standard deviation.

The constitution recognition is a single label classification task, where each tongue image corresponds to a true constitution label. Thus the performance of each model in experiments is evaluated by the accuracy rate.

7.3 Ablation Studies

As our method RWA includes components such as tongue attributes, wavelet attention, and reshape fusion, the effectiveness and necessity of each component should be verified by ablation experiments. It can be seen from Table 2 that the constitution recognition accuracy of the proposed method under different combination of components exceeds the baseline ResNet18 [59]. When only tongue attributes are used, the accuracy of the proposed method is still 1.82% higher than that of the baseline, indicating that the introduction of tongue attributes is very effective for the tongue constitution recognition. At the same time, it also indicates that learning from diagnostic ideas of doctors can effectively help model improve the accuracy of the constitution recognition. After the wavelet attention is further combined, RWA surpasses the baseline with the improved performance by nearly 2.73%. It is by nearly 0.91% higher than RWA with tongue attributes only, showing that the wavelet attention is effective. While tongue attributes are combined with reshape fusion, RWA exceeds the baseline with the improved performance by 2.45%, better than RWA with tongue attributes only by about 0.64%. It shows that the reshape fusion can effectively fuse features from different levels. Compared with the baseline, RWA has achieved an accuracy improvement of about 3.18%, indicating the effectiveness of wavelet attention and reshape fusion in the constitution recognition task. At the same time, they are complementary and achieve consistent the improvement of the performance. In summary, the proposed method has made great progress in the constitution recognition task, resulted from the combination of tongue attributes, wavelet attention, and reshape fusion.

Table 2 Performance of proposed method with different combinations, where \(\surd\) indicates that the component is combined and ResNet18 is the backbone network (Bold: Best Results)

7.4 Compared with Constitution Recognition Methods

Some experiments are conducted to verify that the proposed method is superior to the other methods for tongue constitution recognition. It can be seen from Table 3 that the proposed method surpasses all compared constitution recognition methods, obtaining the best result with an accuracy of 53.95%. At the same time, it also consumes the smallest parameters with only 11.88M, showing its efficiency.

By comparison, VGG-Tongue, GoogleNet-Tongue and ResNet-Tongue achieved the poor performance [42], as they have not adjusted the common network structure according to the constitution recognition task. Both GAZ-mResNet18-EA and GAZ-mResNet18-EADLF introduces the zero-shot learning method for the constitution recognition [44], so that they obtain the better performance. They can learn the mapping relationship between tongue image attributes and the constitution type. As they do not contain components of wavelet attention and reshaping fusion, they cannot obtain the best performance in complex constitution recognition tasks. Particularly, their parameters sizes are obviously larger than that of our method.

Table 3 Performance of proposed method and recent tongue constitution recognition methods, where ResNet18 is the backbone network (Bold: Best Results)

7.5 Compared with Recent Attention Models

As attention models have shown their excellent performance in computer visual fields, they are compared with our method for the constitution recognition task, where different attention methods are considered, namely channel attention, spatial attention, and mixed attention. The self-attention is recently proposed with the powerful performance and great potential in image classification tasks [61], thus it is also compared. All compared methods are implemented and compared. Their experimental results are shown in Table 4 where the suffix “pretrained” indicates that the model has used additional large-scale training data for pre-training.

Table 4 Performance of proposed method and recent attention methods on the tongue constitution recognition, where ResNet18 is the backbone network (Bold: Best Results)

It can be seen from Table 4 that RWA outperforms all compared attention models with the larger margin. Among channel attention methods such as SE-ResNet18, ECA-ResNet18, InI-ResNet50 and SK-ResNet18, the best performance is 52.51%, which is still less than our method by 1.44%. Channel attention only emphasizes the important features in the channel dimension, but they ignore the feature relationship in the spatial dimension, failing to obtain the more complete features of the tongue image. For the spatial attention method such as CBAM-ResNet18, its performance is 50.91%, still worse than our method by 3.04%. In contrast to channel attention, spatial attention only emphasizes the importance of different spatial positions, but does not consider the feature relationships in channel dimensions. Thus it cannot well capture the relationship between different channels in neural networks. On the other hand, the hybrid attention method such as CSRA-ResNet18 combines the channel and spatial attention, but it cannot make full use of global spatial features. It also needs the large amount of computation to generate a three-dimensional attention mask. It can be seen that for self-attention methods such as VIT-small,VIT-small-pretrained, VIT-base, VIT-base-pretrained, Shift-S, and Shift-S-pretrained, their effects without pretraining are very poor, indicating that the self-attention method needs the larger training data. In these methods, Shift-S-pretrained is best, but still fails to surpass our method. This is because it is difficult for them to encode multi-scale features. By contrast, wavelet attention and reshaping fusion can be indeed applied to obtain the more appropriate feature representation for the tongue images so as to perform the constitution recognition task better.

7.6 Compared with Recent Hybrid Deep Learning Methods

Besides attention models, researchers also consider how to improve the performance of deep neural networks from different perspectives, including depth, width, cardinality, and scale. Depth refers to the number of layers in the deep neural network. It can be seen from Table 5 that VGG, ResNet, and DenseNet can improve their feature extraction capability by adding more layers to the deep neural network. Width refers to the number of channels of the features map. For example, Wide-ResNet achieves the better performance by increasing the width. The cardinality refers to the number of convoluted packets. The scale represents the number of feature groups in the deep neural network. We compare our method RWA with these improved models through experiments. It can be seen that RWA still achieves the best performance with small parameters or less computing consumption. On the other hand, when the above factors are added, the over-fitting risk of models is also increasing. For example, the accuracy of ResNet101 with 101 layers is lower than that of ResNet18 with only 18 layers. In the compared models, VGG16 and Res2Net50-14w-8s have the better performance. The former has a simple structure so that it can reduce the over-fitting phenomenon to a certain extent, leading to the better performance. The latter adds multi-scale feature representation, which is more suitable for extracting global and local features of tongue images. It can be summarized that multi-scale features are crucial to the constitution recognition, which is consistent with the idea of our method. Although the parameter size of DenseNet is smallest and its performance is improved with the increase of depth, its maximum performance is only 51.55%, which is still lower than that of our method by 2.40%. It is noteworthy that the parameters of our method are less than that of VGG16, but with the higher accuracy by 1.37%. All these results further show the superiorities of our method.

Table 5 Performance of proposed method and recent hybrid methods for the tongue constitution recognition, where ResNet18 is the backbone network (Bold: Best Results)

7.7 Visualization Analysis for Our Method

Parameter size and performance: In order to validate the proposed method RWA in both parameter size and performance, some methods from Tables 3, 4 and 5 are compared. It is generally believed that the less the parameter size is, the better the model, because it can reduce the resource consumption of hardware equipments such as computing power and storage capacity. Moreover, the larger the parameters size, the easier the model is over fitting. It can be seen from Fig. 5 that RWA uses parameters more effectively than any other method, because it achieves the highest accuracy with fewer or similar parameters size, indicating that RWA not only has the ability to achieve the better performance, but also needs the less parameters.

Fig. 5
figure 5

Schematic diagram of relationship between parameters size and the performance of different models, where each point represents results of a single model, and the broken line represents results of variants of the model with different complexity

Explicability of the proposed method: As each constitution type can be described by attributes of the tongue image, whereas each attribute has semantics, constitution type can be explained by attributes. Our method can predict attributes and then determine the constitution type by whether they are close to the true attributes marked by doctors, so that the predicted constitution type can be explained by the predicted attributes to a certain extent. Thus the proposed method is interpretable, which can provide evidence for the diagnosis results to both doctors and patients. An example is presented to illustrate the interpretability of the proposed method, where one sample in each constitution type is selected from the test tongue images, its predicted attribute vector and true attribute vector are presented simultaneously. It can be seen from Fig. 6 that for all samples with different constitution types, the predicted attribute vectors are very similar to their true attribute vectors, which not only show that our method can accurately identify constitution types, but also provide attributes as the evidences for the recognized types. Our method alleviates the uncertainty caused by the black box of deep neural network, leading to the possibility for assisting doctors in diagnosis.

Fig. 6
figure 6

Visualization of predicted attributes and true attributes of test samples with nine constitution types, where “1” indicates that corresponding attributes are selected, and not selected in other cases

8 Conclusion

A new interpretable tongue constitution recognition method is proposed based on reshaped wavelet attention network. Its advantages lie in that the wavelet attention improves the ability of deep neural network to extract multi-scale features, while the reshape mechanism can make neural network extract multi-level features. It also makes full use of the domain knowledge, making that it not only reduces the negative impact of small samples to improve the performance, but also has the interpretability for the predicted constitution type so as to reduce the risk from the black box of deep learning methods. These advantages can make doctors and patients easily understand the diagnostic results and improve the acceptability of tongue constitution recognition methods. Simultaneously, the proposed method has the small parameters size and fast speed, so that it can nicely support lots of lightweight applications. However, the accuracy of the tongue diagnosis is not sufficient, which should be combined with the other diagnosis methods such as listening, asking, and feeling the pulse so as to further improve the performance. This is our future work.