1 Introduction

Recently, the online medical service has showcased the tremendous growth under the continuous impact of COVID-19 epidemic [1]. A growing number of people seek medical services through online health platforms, such as Haodf and Dxy in China; Teladoc, DoctorOnDemand, Amwell in American. The emerging of these platforms improved the patient’s access to more medical resources comparing to traditional offline medical visits. Generally, these online platforms provide lots of doctors in most of the specialties and amount of the descriptive information about each doctor, like the certification, professional title, total number of visits and the patients’ history reviews of doctors. It allows patients to select and interact with a doctor by using communication technology, such as video, phone or message without any time and place limitation [2].

Plenty of scholars have carried out research of online medicine industry. Some studies investigated the satisfaction influencing factors in the use of telemedicine service from both doctor and patient perspective [3]. Some other studies proposed the concern about the quality of telemedicine [4]. They compared telemedicine with the traditional offline service by the randomized controlled trials or quasi-randomized controlled trials. Meanwhile, there is limited research on patient selection of online doctors. Based on health platform provided massive and heterogeneous information, how to help patients find a high quality and appropriate doctor according to the patient’s personal preference is an understudied question. Designing a good method to help patients select appropriate doctors can contribute to increase the overall patient satisfaction level and build harmonious doctor-patient relationships. The application of recommendation technique to online health field will face data sparsity problems since the patients usually have few interactions with doctors in a long time period on the platform [5]. Guo et al. [6] deployed an unsupervised aggregation approach for integrating various ranking features to identify doctors who have the potential to be the key opinion leaders on a range of diseases. The weight of attributes in their ranking method are only derived from the data objectively, i.e., it ignores the patient’s personal preference, and they do not consider the interaction relationships among different service features.

People who search for online medical services may have different expectations. Some patients ask for health service through online just for its convenience and not serious conditions. While some other patients may seek for the “second opinion”. For example, they want to search an optimal treatment plan through asking several numbers of professionals for its complicated health status. As Lu and Wu [7] indicated, the disease risk can significantly influence patient’s choice. Patients who suffer from a high-risk disease will be more worried and hope to be able to find a doctor with higher technical quality, rather than solely a doctor with high caring behaviors. Conversely, patient cares more about functional quality of doctors when affected by relatively low-risk diseases, and caring behaviors make patients more satisfied. In some situations, patients will have additional high demand on response time. Therefore, it is important to consider the patient’s personal preference in the selection process. Besides, the information asymmetry between the doctor and the patient is a well known problem in medical industry [8], which also exists in online environment. This may lead some doctors to recommend more/less treatment than the patient’s fully informed choice. The patient history reviews displayed in health platforms can alleviate this kind of asymmetry and work as an essential reference source for the quality of the doctor services. Previous studies have also proved the information in online reviews will influence patients’ decisions [9].

Based on the above analysis, this study proposes a MADM approach to assist the patient in selecting a high quality and appropriate doctor. The main work of this study are as follows:

  1. 1)

    To extract patient opinion from review texts, a two stage classification model based on BERT is devised. The extracted features merge with other attributes which could influence patients decisions to build the complete decision matrix. The Choquet integral with respect to fuzzy measure is suggested to model the attribute interaction relationships and aggregate the attribute information.

  2. 2)

    In order to provide ranking for patients with different kinds of requirements, the attribute weight allocation takes both the public and the personalized preferences into account. An optimization model is constructed to identify the public preference based on the doctor consultation volume. Another optimization model is set up to combine the public and the personalized preferences.

The rest of this paper is organized as follows. Related works about health service quality and evaluation, sentiment analysis in health environment are briefly reviewed in Section 2. Section 3 presents some definitions and theorems of the Choquet integral and fuzzy measure. Section 4 introduces the framework of the proposed method, mainly including four parts, sentiment analysis of patient review, public preference identification, personal preference elicitation and preference integration. Section 5 is a case study using the data from Dxy.com to show the procedure of the proposed method. Section 6 proposed the comparison and discussion of the proposed correlated attributes-based method and some conventional independent attributes-based MADM methods. The conclusion and future work are presented in the last part.

2 Related works

2.1 Healthcare service quality and evaluation

The majority of studies used relatively generic quality models to evaluate health or medical service. One of the most representative is the SERVQUAL model and its variant form, which mainly divides the service quality into tangibility, reliability, responsiveness, assurance and empathy five dimensions, focusing mainly on the perception of the service process [10]. For example, Singh and Prasher [11] combined SERVQUAL and Fuzzy-Analytic Hierarchy Process (AHP) to evaluate the service of the Indian healthcare sector. Grönroos [12] divided service quality perception into technical quality (outcome quality) and functional quality (process quality), in which, technical quality focuses on skills, the precision of practices and procedures, diagnosis accuracy, and treatment outcome, while functional quality refers to the methods through which the services are provided to patients. Lu and Wu [7] used the empirical data to examine the technical quality and functional quality of patient online booking choices. The two factors are directly given by patients who have made a review about a doctor. They use the ordinary least squares to check the assumption and also found that these two factors have an interaction effect. Lu et al. [13] used the SERVQUAL theory to measure customer perception of functional quality and combine it with technical quality to build a relatively comprehensive evaluation index of telemedicine. Sheng and Teo [14] proposed that every experience perceived by customers is a complex feeling, and it is difficult to distinguish among such feelings because they may have relevant overlapping areas and clear interrelations. The concept of Shapley index is an effective method for determining the weight of experts with interactive relationships, which can be represented by the fuzzy measures [15]. However, the fuzzy measures in the existing studies are given by experts subjectively or experiment designed by researchers. For non-expert decision-making problem based on online reviews, obtaining the fuzzy measures through not just subjective opinion but online history real decision data makes the decision-making process more objective. To aggregate the information of interactive attributes, the Choquet integral is utilized as an effective tool by researchers [16].

Meanwhile, some research constructed more specific evaluation dimensions from the patient’s perspective. Reader et al. [17] proposed a patient complaint taxonomy. And late research [18] proved its rationality. They use three concepts to summarize the complaint: clinical, management, relationships. López et al. [19] categorized the patient reviews into two evaluation types, specific description and global remark and identified three topics in specific description; interpersonal manner; technical competence; and system issues. Xu et al. [20] used the text mining technique to identify seven aspects, namely, bedside manner, diagnosis accuracy, waiting time, service time, insurance process, doctor knowledge, and office environment from the online patient review. They also explored the impact of different dimensions in doctor review on the choice of patient booking behavior. Table 1 lists some topics identified by previous works. In addition to above, some researches indicated that since most patients lack the knowledge and information required to evaluate the technical (clinical) quality of services, their assessment of the service quality is shaped on the basis of the non-clinical or functional aspects [21]. Hong et al. [22] summarized the researches that compared the patient online review with traditional patient satisfaction surveys. They reported that there is a considerable overlap between the narrative comments of patient reviews and the traditional surveys, but it also exists several identified domains of patient experience that were not covered by the traditional surveys.

Table 1 The identified topics in patients’ reviews

Online medical services and traditional medical services differ in some ways. As a result, the evaluation indexes between online and offline medical services will be a little different. Lots of studies evaluated online medicine services from the whole level which focus on a range of specialties or diseases, and test its feasibility or reliability rather than single doctor level. Few studies proposed an evaluation index system specifically for the online medical services. Nouri et al. [23] summarized and categorized the evaluation criteria for the m-health apps. Lu et al. [24] referred the mobile service quality evaluation and traditional medical quality evaluation and proposed a service quality evaluation indexes for the online medical services from the perspective of regional doctors.

2.2 Sentiment analysis with online reviews

Sentiment analysis (SA) is a widely used technique to analyze the subjective text reviews of consumers on products or services, which is utilized to transform the text reviews of shopping experience and satisfactory into quantifiable sentiment orientation [31]. Consumers and business managers can obtain objective opinions on consumption or decision-making by sentiment analysis [32]. Since the selection of online medical services is a non-expert decision making problem, the objective opinions from the reviews provided decision bases for consumers. For example, measuring health care quality [33], monitoring the outcomes and effectiveness of medications or medical treatment [34].

The existing studies of sentiment analysis methods can be categorized into two types: 1) The lexicon based sentiment analysis method, which is suitable for formal and simple pattern sentences; 2) The machine learning based sentiment analysis method, which performs better on relatively flexible and complicated sentences [35]. Some works combined these two methods to get a more complete and accurate result [36]. When referring to the fine-grained sentiment analysis, the additional task of aspect identification is needed. Zunic et al. [37] provided a review of the sentiment analysis in the health care context, which summarized the commonly used method in health care context and found that few research used the deep learning technique to analyze the sentiment orientation in health care, although the deep learning is proved to have a better performance in the sentiment analysis of product or film reviews. Denecke and Deng [38] proposed different facets of sentiment in health-related texts, such as the accuracy and certainty of a diagnosis, the effectiveness of a treatment outcome, changes in health status and experiences of a sort of drug and. They argued that sentiment in the context of medicine sometimes is difficult to capture in a few words. Yadav et al. [39] used CNN (Convolutional Neural Network) structure to build two classification algorithms for the sentiment analysis of two major medical aspects, medical condition and treatment. Meanwhile, they tested the SentiWordNet (SWN) lexicon and found it not working well in the classification task.

In this study, the BERT neural network [40] is used to identify the public opinion. BERT is a pretraining language model proposed by Google AI research institute, which refreshed the best results of 11 natural language processing tasks in GLUE benchmark (General Language Understanding Evaluation) at its release time.Its stacked self-attention layers have a good capability to model the complex interactive relationship among the characters in the series input structure. Compared with traditional machine learning methods, such as decision tree, support vector machines, Naive Bayes used in Chinese context. One of its advantages is that the minimum unit of input is characters rather than words, which means that word segmentation process is unnecessary before send the text to the model. This advantage reduce the impact caused by word segmentation errors to some extent, which makes it a very suitable model for non-standard Chinese online comment texts. The other advantage is that the BERT model does not need the feature engineering. Preprocessing such as specifically speaking, the stop words removing, n-gram feature constructing are not required for pretrained BERT model as well.

3 Choquet Integral and 2-additive fuzzy measure

Choquet integral with respect to fuzzy measure is an effective and widely used method to deal with the interactions between attributes in MADM problems [15, 41]. This section introduces some definitions and notations related to Choquet integral and fuzzy measure, all of which will be used in the subsequent content of this paper.

Definition 1

[42] Let \(X=\left \{{x_{1},x_{2}{\cdots } x_{n}} \right \} \) be any finite set. P(X) is the subset of X. A discrete fuzzy measure on X is a mapping μ : P(X) → [0,1] satisfying:

$$ \mu (\emptyset ) = 0,\mu (X) = 1 $$
(1)
$$ A \subset B \subset X,implies\ \mu (A) \le \mu (B) $$
(2)

Definition 2

[43] Let μ be a fuzzy measure on X. The Möbius transform φ of μ is defined by:

$$ \varphi {\text{(A) = }}\sum\limits_{K \subset A}{{(- 1)}^{\mid A \setminus K \mid}\mu(K)}, \forall A\subset\mathrm{X} $$
(3)

Under the Möbius transform, the two constraints in Definition 1 is converted to the following form:

$$ \varphi (\emptyset ) = 0,\sum\limits_{A \subset X} {\varphi (A) = 1} $$
(4)
$$ \sum\limits_{i \in B \subset A} {\varphi (B)} > 0,\ for\ all\ A \subset X,\ for\ all\ i \in A\ $$
(5)

The original fuzzy measure is a monotonous set function with 2n coefficients to be determined. When the value of n becomes large, the computational complexity will increase significantly. To deal with this problem, based on Möbius transform, k-additive fuzzy measure [44] is proposed to make a good trade-off between computational complexity and presentation capability.

Definition 3

[44] A fuzzy measure μ is said to be k-additive if its Möbius transform satisfies φ(A) = 0 if AX and |A| > k and there exist at least a subset B such that |B| = k and φ(B)≠ 0.

2-additive fuzzy measure is a widely used form of k-additive fuzzy measure. It satisfies φ(A) = 0 if AX and |A| > 2 and there exist at least a subset B such that |B| = 2 and φ(B)≠ 0. It only considers the interactions of pairwise elements and overlooks the higher order interactions. Since there are only [n(n + 1)/2] coefficients to be determined, it reaches a best compromise between low complexity and rich presentation capability. In the proposed online doctor selection method, 2-additive fuzzy measure is adopted to represent the attribute weight.

Definition 4

[45] Let φ be the Möbius transform of μ, the Shapley importance index which reveals the average contribution of i to all the relevant subsets can be calculated as:

$$ I_{i} = \sum\limits_{A \subset X\backslash \mathop x_{i} } {\frac{1}{{\vert A\vert + 1}}\varphi (A \cup \{ x_{i} \} )} , \sum\limits_{i = 1}^{n} {I_{i} = 1} \ $$
(6)

In order to intuitively approach the concept of interaction, the Shapley interaction index can be given by:

$$ I_{ij} = \sum\limits_{A \subset X\backslash \{ {\text{x}}_{i} , {\text{x}}_{j} \} } {\frac{1}{{\vert A\vert + 1}}\varphi (A \cup \{ x_{i} , x_{j} \} )} , I_{ij} \in [ - 1,1]\ $$
(7)

In the case of 2-additive fuzzy measure, the Shapley importance index and interaction index will be:

$$ I_{i} = \varphi (\{ x_{i} \} ) + \frac{1}{2}\sum\limits_{x_{j} \in X\backslash x_{i} } {\varphi (\{x_{i} ,x_{j} \} )} \\ $$
(8)
$$ I_{ij} = \varphi (\{x_{i} ,x_{j} \} ) $$
(9)

The interaction index Iij measures the average interaction between two elements i and j. It can be expressed ranging from redundancy (negative) to complementary (positive). The value 1 of Iij represents maximum complementary between element i and j (positive interaction), the value -1 represents maximum substitutivity (negative interaction). Therefore, the shapley interaction index can be used to measure the interaction relationship of the correlated attributes in online doctor selection method and the Shapley importance index is used to reflect the average importance of doctor’s each service attribute.

Definition 5

[42] The Choquet Integral of a measurable function f : XR+ in regard to μ is defined by:

$$ C_{\mu} (f) = \sum\limits_{i = 1}^{n} {(f(x_{(i)} ) - f(x_{(i - 1)} ))\mu (\{ x_{(i)} ...x_{(n)} \} )} \ $$
(10)

where (i) stands for a permutation on i such that f(x(1)) ≤ f(x(2)) ≤ ...f(x(n)) and f(x(0)) = 0. Under the Möbius transform φ, the Choquet Integral can be defined as [45]:

$$ C_{\varphi} (f) = \sum\limits_{A \subset X} {\varphi (A)\mathop {\min}\limits_{i \in A} \{ f(x_{i})\} } \ $$
(11)

Based on identified fuzzy measure φ, Choquet Integral can be used to aggregate the attribute information to get the overall evaluation score and the ranking of alternative online doctors.

4 Research framework

4.1 Problem description

The online medical platform which provides online medical consultation services usually shows the doctors’ various attributes and patients historical reviews on its webpage. Users need to decide whom to be chosen as a provider of medical consulting services. This study aims to assist a patient to select an online doctor among several potential ones based on these provided information. Due to the limited understanding of online doctors, online text reviews play an important role for obtaining decision-making attributes. At the same time, users usually have their own personal preferences, which also need to be taken into account. An intelligent doctor ranking system can adapt with patients which have different kinds of personal requirement and help them select the most desirable, satisfactory one. The proposed selection method mainly composed of four parts: 1) sentiment analysis of the patient reviews; 2) public preference learning with 2-additive fuzzy measure; 3) personalized preference representation using Fuzzy-BWM (fuzzy best worst method); 4) integration of the two preference and rank doctors. Figure 1 shows the procedure of the proposed method.

Fig. 1
figure 1

The procedure of the proposed method

4.2 Data acquisition

A web crawler program written by Python 3.7 is achieved to obtain all the doctors’ service attributes (such as consultation price, professional titles, average waiting time etc), and patient historical reviews data from the health platform website. Let D = {d1,d2, ... ,dl}be the set of doctors, C = {c1,c2, ... ,cm}be the set of the doctor’s service attributes, R = {R1,R2, ... ,Rl} be the set of patients review set, where Rk represents an review set composed of text reviews of doctor dk.

4.3 Sentiment analysis based on BERT

For the online medical service industry, users’ historical reviews are an important resource to reflect the doctor’s service quality. To get the sentiment orientation from the reviews, a two-stage sentiment analysis framework is devised. Usually, the user will talk about their consultation experience from several aspects. Thus, the first stage is to predict which aspect each review talks about. The second stage is to predict the sentiment orientation on each mentioned aspect. Two classification models are constructed for each stage. The deep neural network BERT is adopted to be the encoder function of the two classification model [40]. As a pretrained language model, it is responsible for transforming a string of text into a context-aware high dimensional dense tensors. For a small amount of reviews whose number of characters exceeds the maximum length 512, its exceeding text part is discarded. This paper initializes the Bert layer of two classification models with the parameters of pretrained “hfl/chinese-bert-wwm-ext” model [46]. And the parameters of other part are randomly initialized. The python and transformers package are used to achieve the sentiment analysis program. Figure 2 shows the framework of the two stage sentiment analysis. The detailed process is described as follows.

Fig. 2
figure 2

The framework of the two stage sentiment analysis

4.3.1 Aspect classification

In the first stage, suppose there are n predefined aspects in all the reviews, to identify which aspect that each review in review sets R talked about, a multi-label classification model denoted as Model1 is constructed. Multi-label means that each review may involve with more than one aspect. The model contains a basic Bert layer with default configuration and a dense layer with sigmoid activation function. The input of the basic Bert layer is the raw text of each review. The input of the dense layer is the output of [CLS] token of the basic Bert layer. Each output dimension of the dense layer corresponds to one aspect. The calculation process can be represented in (12). Then, a set of labeled samples need to be prepared to train this model. After training, it can use Model1 to predict the mentioned aspects of all the unlabeled reviews in review set R. The set of aspect vector set T = {T1,T2, ... ,Tl} can be obtained, where set Ti is composed of the aspect vector tj(j = 1,2, ... ,|Ri|) for each review in set Ri. The value of element tjk(k = 1,2, ... ,n) in aspect vector tj is assigned as in (13).

$$ \begin{array}{c} text\_vector=Character\_dictionary(review\ text)\\ aspect\_vector=Dense\_layer(Bert\_layer(text\_vector)) \end{array} $$
(12)
$$ t_{jk}= \begin{cases} 1,\ &if\ review\ r_{j}\ is\ predicted\ to\ contain\ the\ \\&aspect\ k.\\ 0,\ &if\ review\ r_{j}\ is\ not\ predicted\ to\ contain\ \\&the\ aspect\ k.\\ \end{cases} $$
(13)

4.3.2 Sentiment orientation classification

In the second stage, to predict the sentiment orientation of the identified aspects of each review, the second classification model denoted as Model2 is constructed. It is composed of a basic Bert layer with default configuration, a dense layer and a softmax layer. The input form of Bert layer refers to [47]. For each aspect, they select several words as the aspect word and repeat the aspect word three times to make up the second sentence. The input is goes like “[CLS] raw review text [SEP] aspect word aspect word aspect word”. For example, if Model1 predict that the sentence “The replies were very detailed and easy to understand” talk about the aspect of quality of Question and Answer(Q&A) and the string “Q&A quality” is selected as the aspect word, the input will be “[CLS] The replies were very detailed and easy to understand [SEP] Q&A quality Q&A quality Q&A quality”. The sentiment orientation is set into three types: positive, negative and neutral. The calculation process can be represented in (14). Another set of labeled samples need to be prepared to train Model2. In the predict phrase, Model2 will classify the sentiment orientation of each contained aspect of each unlabeled reviews in review set R. The output is the set of sentiment vector set S = {S1,S2, ... ,Sl}, where set Si is composed of the sentiment vector sj(j = 1,2, ... ,|Ri|) for each review in set Ri. The value of element sjk(k = 1,2, ... ,n) in sentiment vector sj is assigned as in (15).

$$ \begin{array}{c} text\_vector=Character\_dictionary(review\ text, aspect\ word)\\ sentiment\_vector=Softmax\_layer(Dense\_layer(Bert\_layer(text\_vector))) \end{array} $$
(14)
$$ s_{jk}= \begin{cases} 1,\ &if\ review\ r_{j}\ is\ predicted\ to\ be\ positive\ toward \\ &aspect\ k.\\ -1,\ &if\ review\ r_{j}\ is\ predicted\ to\ be\ negative\ toward \\ &aspect\ k.\\ 0,\ &if\ review\ r_{j}\ is\ predicted\ to\ be\ neutral\ toward \\ &aspect\ k\ or\ predicted\ not\ contain\ the\ aspect\ k. \end{cases} $$
(15)

In the following case study part, when labelling training samples, it turns out that the distribution of categories is extremely unbalanced and most of the reviews in training samples are positive. To enhance the model’s ability to distinguish the negative reviews, the loss of negative class is doubled. It means that the loss of classifying a negative review to a wrong label is twice as to classify a positive review to a wrong label. Some artificial negative samples are also constructed as the supplement by imitating the original negative samples. The pseudo code of the algorithm for the above sentiment analysis process is shown in the Algorithm 1.

Algorithm 1
figure a

Sentiment analysis algorithm.

4.3.3 Defining service features from texts

After getting the aspect vector set T and the sentiment vector set S, two types of features are calculated to reflect the difference between doctors’ review.

For doctor di, the first type of features bk (k = 1, ... ,n)is the proportion of each aspect in doctor di’ s reviews. The second type of features \(\mathop b\nolimits _{k + n} {\text { }}(k = 1,\ ...\ ,n)\ \)is the proportion of positive reviews in each aspect.

$$ b_{k}=\frac{{\sum}_{j=1}^{\mid R_{i}\mid } t_{jk}}{\mid R_{i}\mid } $$
(16)
$$ b_{k+n}=\frac{{\sum}_{j=1}^{\mid R_{i} \mid }s_{jk}\ (if\ s_{jk}=1)}{{\sum}_{j=1}^{\mid R_{i} \mid } t_{jk}} $$
(17)

where \(\lvert R_{i}\rvert \) represents the total number of reviews about doctor di. Let set B = {b1,b2, ... ,b2n} represents the features calculated from review texts. Combining set B with the attribute set C, it can get a complete attribute set X = {B,C}. Let matrix E(m+ 2nl represents the attributes values of all the doctors.

4.4 Public preference learning based on 2-additive fuzzy measure

To help a user select an appropriate doctor from the online health platform, especially new users which have little knowledge about what kind of doctor is appropriate for herself/himself, it would be helpful to refer to others selection experience. Generally, the preferences of individuals in a group are slightly different from each other. In group decision making field, to get the preference which can represent the whole public, they use different consensus reaching process (CRP) to aggregate the individual’s preference [48,49,50]. In online health platform, the aggregated public’s preference reflect their values toward the importance of different service attributes. This can be inferred through each doctor’s consultation volume in a fixed time period. A doctor with high value of consultation volume means that the doctor’s service attributes satisfy more users’ need. When a user chooses a specific doctor, it represents that this doctor is his best alternative after limited comparison. To deal with the potential complex service attribute relationships, 2-additive fuzzy measure is adopted to represent the public group preference. It has the ability to model attribute interaction relationship among pairs including redundancy and complementary. If no interaction exists among attributes, it will reduce to weighted arithmetic mean. Considering the attribute interaction can increase the accuracy of the learned public’s preference. The attributes in set X are the decision attributes. The Choquet integral aggregate the interactive attribute information. An optimization model is constructed here [43] to identify the parameters in 2-additive fuzzy measure. Each doctor’s consultation volume in a fixed time period is taken as the doctors’ overall evaluation score. The idea is to use the Choquet integral to fit each doctor’s consultation volume. Therefore, the objective function of the optimization model is minimizing the sum of the squared difference between all the doctors’ Choquet Integral calculated using the attribute values in matrix E and his/her overall evaluation score.

$$ Z = \sum\limits_{k = 1}^{l} {{(C_{a} (f_{1}^{k}} ,...f_{m + 2n}^{k} ) - s_{k} )}^{2} $$
(18)

where skare the overall evaluation score of doctor k. \( f_{1}^{k} ,...,f_{m + 2n}^{k} \ \)are the normalized value of each attribute of doctor k. The constraints are the monotonicity and non-negative constraint of the fuzzy measure as in (4) and (5). Before taking the value in matrix E(m+ 2nl into the optimization model, min-max normalization is calculated along its column. After transformation, this optimization problem can be reduced to a quadratic programming problem with 2m+ 2n − 1 variables and (m + 2n) ∗ 2(m+ 2n)− 1 + 1constraints, written as follows [43]:

$$ \begin{array}{c} \text{minmize} Z = \frac{1}{2} a^{T} F_{a} a + {\Gamma}_{a}^{T} a\\ s.t.\left\{ \begin{array}{l} f^{T} a = 1\\ G a + b\ge 0 \end{array} \right. \end{array}\ $$
(19)

where a is the variable vector containing the value of φb for ∀AX. F and G are parameter matrix. Γ and f are parameter vectors. Solving the quadratic programming problem, the 2-additive fuzzy measure φb representing the aggregated public preference can be obtained.

4.5 Personal preference representation through Fuzzy-BWM

Except for referring to others preference, it is also necessary to consider user’s personal preference in the final decision stage. In this step, the Fuzzy-BWM method is adopted to help users express their own preference. The BWM model was introduced by [51]. Later Guo and Zhao [52] extended it to the fuzzy space and proposed the Fuzzy-BWM method. Compared with previous similar methods, BWM takes both the computational efficiency and the ability to check the consistency into account. The main steps of the Fuzzy-BWM method are as follows:

Step 1

The user determines the most important and least important attribute in the attribute set X.

Step 2

Compare other attributes with the most important and least important attribute, using the linguistic variables listed in the first column in Table 2 to represent the comparison result. The most important to others vector is as \( {\tilde A}_{B} = ({\tilde a}_{B1} ,{\tilde a}_{B2} ,...{\tilde a}_{Bn} )\). The others to the least important vector is as \( {\tilde A}_{w} = ({\tilde a}_{1w} ,{\tilde a}_{2w} ,\ ... {\tilde a}_{nw} )\). \( {\tilde a}_{Bj} \) represents the fuzzy preference of the best attribute XB over the attribute Xj. \( {\tilde a}_{jW} \) represents the fuzzy preference of attribute Xi over the worst attribute XW. \({\tilde a}_{BB} = (1,1,1)\) and \({\tilde a}_{WW} = (1,1,1)\).

Table 2 The linguistic term set

Step 3

Transform the linguistic variables to fuzzy ratings represented by triangular fuzzy number (TFN), and the rules of transformation are listed in Table 2.

Step 4

Use the optimization model to get the best weights. Considering \(\mathop {\bar w}\nolimits _{j},\ {\bar w}_{W} {\text {and}} {\bar w}_{B} \ \) as TFNs. Use \({\bar w}_{j} = (l{_{j}^{w}} ,m_{j}^{w} ,u_{j}^{w} )\ \)to represent the fuzzy weight of attribute j. The optimization problem (20) can be constructed.

$$ \begin{array}{c} \min \underset{j}{\max} \{|\frac {\tilde w} {\tilde w}_{j} - {\tilde a}_{Bj} \mid,\mid\frac {\tilde w} {\tilde w}_{w}- {\tilde a}_{jw} \mid\} \\ s.t.\left\{ \begin{array}{l} \sum\limits_{j = 1}^{n} R({\tilde w}_{j} )= 1\\ l_{j}^{w} \le m_{j}^{w} \le u_{j}^{w} \\ l_{j}^{w} \ge 0\\ j = 1,2,...,n \end{array} \right. \end{array}\ $$
(20)

It can be transferred to the optimization form as in (21).

$$ \begin{array}{c} \min \tilde \xi \\ s.t.\left\{ \begin{array}{l} \mid\frac{{\tilde w}_{B}} {{\tilde w}_{j}} - {\tilde a}_{Bj} \mid \le \tilde \xi \\ \mid\frac{{\tilde w}_{j}} {{\tilde w}_{w}} - {\tilde a}_{jw} \mid \le \widetilde \xi \\ \sum\limits_{j = 1}^{n} R({\tilde w}_{j} ) = 1\\ l_{j}^{w} \le m_{j}^{w} \le u_{j}^{w} \\ l_{j}^{w} \ge 0\\ j = 1,2,...n \end{array} \right. \end{array}\ $$
(21)

Solving the above optimization problem, the optimal fuzzy weights \(({\tilde w}_{1}^ * ,{\tilde w}_{1}^ * ,\ ...\ ,{\tilde w}_{n}^ * )\ \)can be obtained. Next the graded mean integration representation (GMIR) (22) is used to transform fuzzy weight of attribute to crisp weights. Let Wp = {w1,w2, ... ,wm+ 2n}represents the calculated user’s personal attribute weight.

$$ R({\tilde a}_{i} ) = \frac{l_{i} + 4 m_{i} + u_{i}}{6}\ $$
(22)

Step 5

Calculate the consistency ratio (CR) to check the consistency degree of pairwise comparison. According to [52], following equation is used to calculate the consistency index.

$$ \xi^{2} - (1 + 2 u_{BW}) \xi + (u^{2} - u_{BW}) = 0\ $$
(23)

where uBW is the upper boundary of \({\tilde a}_{BW} \). By solving the (23), it can get the consistency index ξ. The CR can be calculated as:

$$ consistentcy ratio = \frac{{\xi^ *}}{{consistency index}}\ $$
(24)

where CR ∈ [0,1]. The lower the CR the more consistent the user’s given comparisons.

4.6 Integrating public preference and personal preference

In this part, the public’s preference φb acquired in Section 4.4 needs to integrate with the personal preference Wp acquired in Section 4.5. Since the public’s preference is represented by the 2-additive fuzzy measure and the personal preference is represented by the common weight vector. To fuse these two kind of preference structure, an optimization model is build to transform the common weight vector Wp to the 2-additive fuzzy measure form φp. The Shapley importance index reflects the overall importance of an attribute and has the same form with the common weight vector. Thus, the first constraint is that the Shapley importance index vector Ib of φb should equal to the weight vector Wp at each attribute. The second constraint is the monotonicity constraint of the 2-additive fuzzy measure. Because the Shapley importance index and the 2-additive fuzzy measure are not one to one correspondence, to get a single solution, the new fuzzy measure φp is expected to be as close as possible to φb. For the attribute set with two elements, the new fuzzy measure φp is expected to have the difference with φb only in the set which weight is not zero in φb. Hence, the objective function can be divided into two terms. The first term is defined as minimize the squared distance between φp and φb in attribute set which has one element or whose value is not zero in φb. The second loss term is defined as minimize the weight of attribute set whose value is zero in φb. The optimization problem can be represented as follows:

$$ \begin{array}{c} \min \sum\limits_{A \subset Set1} {(\varphi_{b}(A) - \varphi_{p}(A))}^{2} + 2 * \sum\limits_{A \subset Set2} {\varphi_{p}(A)}^{2} \\ s.t.\left\{ \begin{array}{l} \sum\limits_{A \subseteq X \backslash x_{i} } {\frac{1}{\vert A\vert + 1}}\varphi_{p}(A \cup \{ x_{i} \} ) = {\text{w}}_{i} {\text{}} for all i \in X\\ \sum\limits_{A \subset X} {\varphi_{p}(A)} = 1\\ \sum\limits_{i \in B \subset A} {\varphi_{p}(B) \ge 0 } for all A \subset X, for all i \in A \end{array} \right. \end{array} $$
(25)

where Set1 = {AAXand|A| = 1 or |φb(A)| > 0}, Set2 = {AAXand|A| = 2 and φb(A) = 0}. Fuzzy measure φp contains the set weight that need to be determined. Solving the problem (25), the user’s personal preference with 2-additive fuzzy measure form can be obtained. Then (26) is used to combine it with the public preference φb.

$$ \varphi(A) = (1 - \gamma ) * \varphi_{b}(A) + \gamma * \varphi_{p}(A) \forall A \subset X\ $$
(26)

γ(0 ≤ γ ≤ 1)is a parameter specified by the user to determine the relative importance of these two kind of weights. Patients with chronic diseases or long term conditions may familiar with which kind of doctors are suitable for them since they have a long interaction experience with doctors. They can set γ to a low value and rank doctors based more on their own preference. People who seldom get sick may not have a clear, firm idea about which kind of doctor are appropriate for them and their conditions, which doctor attribute is more important and which is less important. They can set γ to a high value and rank doctors based more on other’s selection experiences. It’s easy to prove that φ still satisfies the monotonicity constraint. Based on the final fuzzy measure φ, the Choquet Integral as the overall evaluation score can be calculated using (11). The user can get a rank of his doctor alternatives through sort of the overall evaluation score.

5 Case study: doctor selection in Dxy.com

5.1 Case description

This section presents a numerical case study to illustrate the proposed method. Dxy.com is one of the relatively large platforms in China, providing online medical service to the public. Its service has covered most of the department that the large offline hospital has. Its doctors mainly come from the offline tertiary hospital, a few of them are owned by the platform. Figure 3 shows two representation pages of a doctor in Dxy APP. The left one includes the doctor’s name, title, hospital, price, skills etc, while the right one is the patient reviews page.

Fig. 3
figure 3

Screenshots of doctor representation page on dxy APP

According to [53], dermatology, ophthalmology-otorhinolaryngology and gynecology obstetrics pediatrics are most likely be consulted department across all platforms. Based on the websites provided information, four departments the pediatrics, gynaecology, neurology and gastroenterology are selected as experiment. Using the web crawler, it gets all the doctors information, including their objective attributes and patient reviews. In preliminary screening, it deletes the doctors whose monthly reply is lower than 8. Since this group of doctors are busy with handling the offline work affairs and have little time to respond to the health service request from online. Finally, there are 271, 87, 123, 226 doctors in pediatrics, neurology, gynaecology and gastroenterology respectively.

5.2 Decision process

Step 1

Sentiment analysis According to the method described in Section 4.3, it first needs to determine the topics in the review. Referring to previous relevant literature [29] and validated in training dataset annotation process, it selects and categorizes the evaluation information into four topics. Each topic will be briefly introduced in the following. Table 3 shows some review examples of each topic.

  1. (1)

    The first topic is about the doctor’s medical ability. It mainly involves the patients description about diagnostic accuracy, health status after following the doctor’s advice or prescription.

  2. (2)

    The second topic is about the quality of question and answer (Q&A) between patients and doctors online. Since the main interaction form in this platform are the non-instant message or voice, the quality of the information encoding and decoding are very important. Patients often talk about whether the suggestions given by doctors are detailed, clear and easy to understand.

  3. (3)

    The third topic is about the doctor’s service attitude. A typical negative example is using templates to answer users’ questions. In some cases, users will complain that doctors selectively answer their questions and are unwilling to answer all the questions. When patients use voice to communicate with doctors, he/she can feel the doctor’s patience and kindness. A good doctor should be understanding, know about what the patients are worried about and reduce patient anxiety, according to a reviewer saying in his comment.

  4. (4)

    The last topic is the comprehensive evaluation about the service. When manually annotating review samples for the training of the classification model, similar to [19], we also find that some users tend to evaluate the services from the whole, while others tend to assess the doctors from one or two specific aspects. The typical expression is like very satisfied, very nice, very recommended. In the attribute calculation process, when a review has been assigned this label, we take it as mentioned all the above three subtopics.

Table 3 Expression examples in four topics

To train two classification models, 2000 reviews are manually annotated to be the training dataset and 271 reviews to be the validation set. The evaluation metric of two stage classification result is shown in Table 4. We choose F1-score and accuracy to reflect the performance of the two model in first stage and the second stage respectively. The first column in Table 4 are the number of different topic examples in training dataset. Figure 4 shows the average topic distribution in four departments, calculated from the output of Model1. Across these four departments, it can be seen that the quality of Q&A is the most mentioned topic. This is mostly due to that from the patient’s perspective, quality of Q&A is the most easily perceived aspect in the interaction process with doctors. Medical ability mentioned frequency in Neurology is little bit higher than other three sectors. This is consistent with our daily medical experience. Since most diseases in this department are chronic and hard to cure, like insomnia, convulsions, epilepsy etc. When searching for professional help, patients tend to focus more on the doctor’s medical ability. It also found that in most cases, a doctor will be evaluated to be good at all the three aspects, few of the doctors are evaluated perform well in only one or two aspects.

Table 4 Evaluation of two stage classification result
Fig. 4
figure 4

Topic proportion in four departments

Step 2

Identification of public preference with 2-additive fuzzy measure

In this step, it need to build a decision matrix to represent the patients decision situation and identify the 2-additive fuzzy measure based on it. The following eight indexes are selected as the attribute:

  • Review number, the total number of patient reviews of the doctor.

  • Consultation price, the price the user has to pay if he/she want to consult the doctor online.

  • Weekly praise rate, on the doctors’ presentation page, users can see labels the website has assigned to the doctor, each doctor has several labels, like “National top 100 hospitals”, “Professional excellence”, “Doctor Degree of medicine”, “Weekly praise rate top 10”, “charity ambassador” etc. In all these labels, the weekly praise rate(top10, top3, top2, top1) is found to highly correlated to the monthly consultation volume, so the 0-1 variable is build to indicate whether the doctor has this label.

  • Average(Avg) reply time, average waiting time from the user asking doctor the first question to the doctor’s first response.

  • Job title, including chief doctors, associate chief doctors, attending doctors and resident doctors. In the data preprocessing stage, the chief doctors, associate chief doctors, attending doctors and resident doctors are transformed to the numerical scale 1, 0.75, 0.5, 0.25.

  • Medical ability, this index is obtained by adding two parts together. The first part is given by the label on the doctors home page, if a doctor has been given the label of “Professional excellence”, or “experts in this field”, each label will add 0.25 score to this index. The second part is the topic ratio of medical ability in reviews.

  • Service attitude, this index corresponds to one of the topics in sentiment analysis, it mainly involves the patients perceived working attitude, attentiveness etc.

  • Quality of Q&A, this index is also obtained by summing two parts. Like medical ability, the first part is given by the labels on the doctors home page, the second part is topic ratio of quality of Q&A in reviews. If a doctor has been given the label of “patient and careful”, it will add 0.25 score to this index.

  • Monthly consultation volume, this index is the result variable, the aggregated result of different users decisions and be taken as the evaluation score of each doctor.

The medical ability, service attitude and quality of Q&A need to be calculated by the sentiment analysis. The other six indicators can directly be obtained from the representation page. Previously mentioned proportion of positive reviews in each topic got from sentiment result were excluded. Since it found that in each department for more than 90% of the doctors, the negative percent is lower than 1%. The influence of this slight proportion of negative reviews to the user’s decisions can be totally ignored. The situation in reality is that when users browsing the doctor’s review page, it’s hard to find a negative review.

Table 5 shows a statistic description of the nine variables in pediatrics. We found that in all these four departments, the average reply time, monthly reply and review number dimension has the form of long tailed, J-shaped distribution. The consultation price, medical ability, service attitude and quality of Q&A has the form of normal distribution. To enhance the models fitting ability, the log-transformation is applied to the monthly reply dimension, average reply time. Then, it makes a min-max normalization to each variable. The price and average reply time are a cost variable which means the smaller value is better than the bigger value. So these two variables are normalized in the opposite direction. After the above preprocess, a complete decision matrix is constructed. Table 6 shows part of the normalized data in pediatrics.

Table 5 The statistical description of the nine variables in pediatrics before normalization
Table 6 The normalized performance of ten doctors in pediatrics

Next according to Section 4.4, it needs to construct an optimization model to determine the public preference. One optimization model is set up for each department and the Gurobi software is used to solve it. Table 7 shows the optimization result of the Möbius transform fuzzy measure. For the 2-additive fuzzy measure with 8 attributes, it has 36 variables in total. To check the result more intuitively, the Möbius form is transformed into the Shapley index. The result is shown in Table 8. The Shapley index value gives an overall importance of each attribute after considering interaction behaviors between attributes. The last two lines in Table 8 show the Choquet Integral’s fitting ability of the doctor’s assigned overall evaluation score. The second last row is the sum of the evaluation score of all the doctors in each department and the last row is the sum of square prediction error in each department.

Table 7 The identified public preference in four departments
Table 8 The Shapley index for eight attributes in four departments

Step 3

Obtaining the individual preference

In this step, the patient needs to use the linguistic terms listed in Table 2 to express his/her personal preference toward the attributes. Suppose a patient wants to select a doctor from the Table 6, he/she gives his/her two pairwise weight comparison vectors as in Table 9.

Table 9 User’s two comparison vectors over eight attributes

According to (21), it can construct the optimization problem (27). Using the commercial programming solver Gurobi to solve (27), the attribute weight represented by triangular fuzzy number can be obtained. The optimization result of minimum k is 0.814. Next, using (22) to transform the triangular fuzzy number into crisp number, the result is shown in Table 10. Finally, it canuse (23) and (24) to calculate the consistency ration. The result is 0.814/8.04 = 0.101, which indicates a high consistency.

Table 10 Personal preference represented in triangular fuzzy number and crisp number form
$$ \begin{array}{c}\min k \\ s.t. \left\{ \begin{array}{c} {l_{5}-3.5*u_{1}\le k*u_{1},\ l_{5}-3.5*u_{1}\ge -k*u_{1}}\\[0.5ex] {m_{5}-4*m_{1}\le k*m_{1},\ m_{5}-4*m_{1}\ge -k*m_{1}}\\[0.5ex] {u_{5}-4.5*l_{1}\le k*l_{1},\ u_{5}-4.5*l_{1}\ge -k*l_{1}}\\[0.5ex] {l_{5}-1*u_{2}\le k*u_{2},\ l_{5}-1*u_{2}\ge -k*u_{2}}\\[0.5ex] {m_{5}-1*m_{2}\le k*m_{2},\ m_{5}-1*m_{2}\ge -k*m_{2}}\\[0.5ex] {u_{5}-1*l_{2}\le k*l_{2},\ u_{5}-1*l_{2}\ge -k*l_{2}}\\[0.5ex] {l_{5}-2/3*u_{3}\le k*u_{3},\ l_{5}-2/3*u_{3}\ge -k*u_{3}}\\[0.5ex] {m_{5}-1*m_{3}\le k*m_{3},\ m_{5}-1*m_{3}\ge -k*m_{3}}\\[0.5ex] {u_{5}-1.5*l_{3}\le k*l_{3},\ u_{5}-1.5*l_{3}\ge -k*l_{3}}\\[0.5ex] {l_{5}-2/3*u_{4}\le k*u_{4},\ l_{5}-2/3*u_{4}\ge -k*u_{4}}\\[0.5ex] {m_{5}-1*m_{4}\le k*m_{4},\ m_{5}-1*m_{4}\ge -k*m_{4}}\\[0.5ex] {u_{5}-1.5*l_{4}\le k*l_{4},\ u_{5}-1.5*l_{4}\ge -k*l_{4}}\\[0.5ex] {l_{5}-1*u_{6}\le k*u_{6},\ l_{5}-1*u_{6}\ge -k*u_{6}}\\[0.5ex] {m_{5}-1*m_{6}\le k*m_{6},\ m_{5}-1*m_{6}\ge -k*m_{6}}\\[0.5ex] {u_{5}-1*l_{6}\le k*l_{6},\ u_{5}-1*l_{6}\ge -k*l_{6}}\\[0.5ex] {l_{5}-2/3*u_{7}\le k*u_{7},\ l_{5}-2/3*u_{7}\ge -k*u_{7}}\\[0.5ex] {m_{5}-1*m_{7}\le k*m_{7},\ m_{5}-1*m_{7}\ge -k*m_{7}}\\[0.5ex] {u_{5}-1.5*l_{7}\le k*l_{7},\ u_{5}-1.5*l_{7}\ge -k*l_{7}}\\[0.5ex] {l_{5}-2.5*u_{8}\le k*u_{8},\ l_{5}-2.5*u_{8}\ge -k*u_{8}}\\[0.5ex] {m_{5}-3*m_{8}\le k*m_{8},\ m_{5}-3*m_{8}\ge -k*m_{8}}\\[0.5ex] {u_{5}-3.5*l_{8}\le k*l_{8},\ u_{5}-3.5*l_{8}\ge -k*l_{8}}\\[0.5ex] {l_{2}-3.5*u_{1}\le k*u_{1},\ l_{2}-3.5*u_{1}\ge -k*u_{1}}\\[0.5ex] {m_{2}-4*m_{1}\le k*m_{1},\ m_{2}-4*m_{1}\ge -k*m_{1}}\\[0.5ex] {u_{2}-4.5*l_{1}\le k*l_{1},\ u_{2}-4.5*l_{1}\ge -k*l_{1}}\\[0.5ex] {l_{3}-1.5*u_{1}\le k*u_{1},\ l_{3}-1.5*u_{1}\ge -k*u_{1}}\\[0.5ex] {m_{3}-2*m_{1}\le k*m_{1},\ m_{3}-2*m_{1}\ge -k*m_{1}}\\[0.5ex] {u_{3}-2.5*l_{1}\le k*l_{1},\ u_{3}-2.5*l_{1}\ge -k*l_{1}}\\[0.5ex] {l_{4}-1.5*u_{1}\le k*u_{1},\ l_{4}-1.5*u_{1}\ge -k*u_{1}}\\[0.5ex] {m_{4}-2*m_{1}\le k*m_{1},\ m_{4}-2*m_{1}\ge -k*m_{1}}\\[0.5ex] {u_{4}-2.5*l_{1}\le k*l_{1},\ u_{4}-2.5*l_{1}\ge -k*l_{1}}\\[0.5ex] {l_{6}-2/3*u_{1}\le k*u_{1},\ l_{6}-2/3*u_{1}\ge -k*u_{1}}\\[0.5ex] {m_{6}-1*m_{1}\le k*m_{1},\ m_{6}-1*m_{1}\ge -k*m_{1}}\\[0.5ex] {u_{6}-1.5*l_{1}\le k*l_{1},\ u_{6}-1.5*l_{1}\ge -k*l_{1}}\\[0.5ex] {l_{7}-2.5*u_{1}\le k*u_{1},\ l_{7}-2.5*u_{1}\ge -k*u_{1}}\\[0.5ex] {m_{7}-3*m_{1}\le k*m_{1},\ m_{7}-3*m_{1}\ge -k*m_{1}}\\[0.5ex] {u_{7}-3.5*l_{1}\le k*l_{1},\ u_{7}-3.5*l_{1}\ge -k*l_{1}}\\[0.5ex] {l_{8}-2/3*u_{1}\le k*u_{1},\ l_{8}-2/3*u_{1}\ge -k*u_{1}}\\[0.5ex] {m_{8}-1*m_{1}\le k*m_{1},\ m_{8}-1*m_{1}\ge -k*m_{1}}\\[0.5ex] {u_{8}-1.5*l_{1}\le k*l_{1},\ u_{8}-1.5*l_{1}\ge -k*l_{1}}\\[0.5ex] 1/6*l_{1}+4/6*m_{1}+1/6\ *u_{1}+1/6*l_{2}+4/6*m_{2}+\\[0.5ex] 1/6*u_{2}+1/6*l_{3}+ 4/6\ *m_{3}+1/6*u_{3}+1/6*l_{4}+\\[0.5ex] 4/6*m_{4}+1/6*u_{4}+1/6\ *l_{5}+4/6*m_{5}+ 1/6*u_{5}+\\[0.5ex] 1/6*l_{6}+4/6*m_{6}+1/6\ *u_{6}+1/6*l_{7}+4/6*m_{7}\\[0.5ex] +1/6*u_{7}+ 1/6*l_{8}+\ 4/6*m_{8}+1/6*u_{8} = 1\\[0.5ex] {l_{1}\le m_{1}\le u_{1}}\\[0.5ex] {l_{2}\le m_{2}\le u_{2}}\\[0.5ex] {l_{3}\le m_{3}\le u_{3}}\\[0.5ex] {l_{4}\le m_{4}\le u_{4}}\\[0.5ex] {l_{5}\le m_{5}\le u_{5}}\\[0.5ex] {l_{6}\le m_{6}\le u_{6}}\\[0.5ex] {l_{7}\le m_{7}\le u_{7}}\\[0.5ex] {l_{8}\le m_{8}\le u_{8}}\\[0.5ex] {l_{1}\ge 0,l_{2}\ge 0,l_{3}\ge 0,l_{4}\ge 0,l_{5}\ge 0,l_{6}\ge 0,l_{7}\ge 0,l_{8}\ge 0} \end{array} \right. \end{array} $$
(27)

Step 4

Integrating Preference and Ranking In this step, attribute weights obtained in step 2 and step 3 need to be integrated. The user has to specify the parameter representing the relative importance of these two parts. For comparison, γ is set to four values 0.25, 0.5, 0.75 and 1. Using Gurobi to solve the optimization model (25), the transformed fuzzy measure can be obtained. Then the combined fuzzy measure can be calculated using (26). The result is shown in Table 11.

Table 11 Integrated attribute preference with different γ

The comprehensive score of each doctor in Table 6 can be calculated through the Choquet integral (11) with the 2-additive fuzzy measure in Table 11. The result is shown in Table 12. Figure 5 shows these doctors’ ranking position variation diagram. The attribute weights gradually change from depend only on public preference to depend only on personal preference.

Table 12 The final evaluation score of ten doctors in Table 6
Fig. 5
figure 5

Doctor ranking with different γ

5.3 Result analysis

From Table 7, it can be seen that based on the decision matrix, the optimization model do identified some non-zero attribute interaction index values. A typical positive pair is price and weekly praise rate, which means only when these two attributes get high values simultaneously, they can contribute to a high overall evaluation score. A typical negative pair is the review number and weekly praise rate. Negative interaction means that the high performance of both attributes will not increase the overall evaluation score too much than only one of them get high performance. These can demonstrate that in the practical selection process, the patient is likely to consider the redundancy and complementary attribute relationships. In Table 8, the Shapely value of these attributes reflect their importance in a more intuitive way. Across these four departments, weekly praise rate, review number and avg waiting time are always the three most important attributes. The weights of attributes calculated based on the reviews are relatively small. This is possibly due to that from the user’s perspective, it’s hard to tell a remarkable difference between doctors’ online review texts. Benefiting from the careful management and refined operation of the platform, doctors with obvious negative or abnormal reviews will be ordered to make adjustments. After a long time, the distribution of the sentiment of patient reviews among doctors become similar.

In Fig. 5, the ranking order of doctors slightly changed with different value of parameter γ. But in general, the ranking are stable without drastic changes. This can prove that the parameter γ can appropriately adjust the rankings of doctors according to the patient’s confidence of his own preference.

All the information used in decision process has public access without privacy concern, which is very practical in reality.

5.4 Computational complexity

  1. (1)

    The proposed two stage BERT based opinion extraction method is relatively robust and with high accuracy. The BERT encoder works as a strong base-line method by many tasks in natural language processing field. After training, it can be efficiently calculated without take too much time and space complexity.

  2. (2)

    What’s more, for n attributes, BWM only need 2n pairwise comparisons, which is less than the AHP’s n2 pairwise comparisons. In other words, Comparing to AHP method, the fuzzy-BWM used in proposed method have less comparison times. Hence, it’s more convenient for patients to express his/her values of the importance of attributes.

  3. (3)

    The three involved optimization problems in decision process are linear programming or quadratic programming problem. Using existing commercial solver, these problems can converge fast and get their optimal result.

6 Comparasion analysis

To check the influence of considering the attribute interactions, comparison analysis is conducted. The first comparison method uses linear weighted aggregation to replace the Choquet integral, in which, the attribute weight representing the public preference is replaced with the result of linear least squares under the constraint that the sum of weights equals to one. The personal weight is the same as case study part, and parameter γ is set to 0.5. In previous studies, considering the privacy protection problem, Hu et al. [54] combined the intuitionistic fuzzy sets (IFSs) and VIsekriterijumsko KOmpromisno Rangiranjie (VIKOR) to build a doctor ranking method. IFS is a kind of set to represent the fuzzy and uncertain evaluation values, and VIKOR is a MADM method which features in providing a set of compromise solutions under a series of conflicting criteria. Sun et al. [55] proposed two doctor selection method based on single-valued neutrosophic sets (SVNSs) and prospect theory (PT). The extended TODIM (an Acronym in Portuguese of Interactive and MADM) and extended ELECTRE (elimination and choice translating reality) III, another two kinds of MADM method, were used as the ranking function for the two methods respectively. To make these three methods comparable with the method in this paper, their fuzzy evaluation set is replaced with crisp value. And the attribute weight allocation is also replaced with linear-based as in first comparison method. They are denoted as linear+VIKOR, linear+TODIM, linear+ELECTRE III respectively in Table 13. Additional parameters of these three methods are identical with original paper’s settings. Based on the above, using the data from pediatrics department, the proposed method compare with four methods to verify the influence of attributes’ correlation, and illustrate the advantage of the proposed method. Table 13 shows these methods’ ranking results.

Table 13 Ranking results of different methods

As for the ability to fitting the public preference, the final optimization value of objective Z is 8.299, which can be calculated by (18) and model (19), while the linear least squares based method is 8.449. This can show the 2-additive fuzzy measure’s better capacity to capture the public’s attribute preference. From Table 13, it could be seen that compared with proposed method, the order of d1 and d5, d4 and d10 exchanged in both linear and linear+TODIM method. This can demonstrate that the attribute interactions can make an impact on final rankings. And it has the necessity to consider the interaction relationships during decision process. The result of linear+ELECTRE III method is quite different from others, this mainly can be ascribed to its sensitivity to its three threshold parameters. The result of linear+VIKOR is most similar to the proposed method, but the parameter setting in this method is focused more on modeling the behavior preference when facing risks and uncertain conditions instead of complicated attribute preference relationships.

In addition to the above, the proposed method have the following two advantages:

  1. (1)

    Firstly, the original weight allocation method in similar studies [54, 55] require decision makers to give the importance of attributes before or weight compare information in pairs, which is unrealistic in some special situations. The proposed method can obtain the weight of attributes by learning the public’s history decision information, and then it can deal with the problem where decision makers have no information of attributes importance.

  2. (2)

    Secondly, our method can integrate the public weight and personal weight. To do that, an optimizaiton model (25) is build to transform the the personal weight to the same form with public weight. A parameter γ is used to balance these two kind of weights, making the ranking method has the ability to provide different ranking result for different kind of users. Higher γ means that the user has a high demand and confident of his/her own preference. Lower γ means that the user has little selection experience, and the ranking need to based more on public weight. Figure 5 shows our ranking order is more flexible than the results of previous methods in Table 13.

7 Conclusion

With the continuous development and improvement of the online medical platform and relevant standards, growing number of people will turn to online requesting medical and health information services. This study proposes a doctor ranking method based on the information in online health platform. The proposed method can help patients select appropriate doctors and increase the overall patient satisfaction level. For healthcare providers and platforms, it can also help them better understand the patient’s need and preference. This method can also extend to some other related field [56, 57] with proper adjustment.

The main contribution of this study is presented as follows:

  1. (1)

    This study proposes an online text review-based online doctor selection method which considers the correlated attributes. Existing online reviews-based MADM methods assume that attributes are independent of each other, which is hard to match the reality of online doctor selection. On the other hand, the traditional MADM method that considers the correlation of attributes measures the correlation subjectively, rather than from actual data. Moreover, this paper addresses a MADM method that considers correlation attributes based on realdata, and therefore it can regarded a complete new typed decision making problem.

  2. (2)

    It proposes a MADM method based on both public preference and personal preference. Existing studies ignore the personal preferences that have same importance with the public preference derived from the online text reviews. To resolve this problem, the public preference is identified through an optimization model with the provided doctor consultation volume information. Then, a novel optimization model is proposed to combine the public preference and personal preference. Therefore, two kind of preference the patient’s personal preference and the public’s preference jointly determine the weight of the attribute to help patients with different degree of experience.

There are several potential directions that can be improved in this paper. For example, in above, the utility function on each single attribute is assumed to be monotonically increasing or decreasing. As Kwok and Lau [58] indicated, the price can be a typical non-monotonic attribute. Some users just want to find an intermediate level price. They don’t want to spend too much or too low. Future work can also try to make a smaller granularity level decomposition of the patient’s final choice, separate the decision makers automatically and provide more fine-grained ranking recommendations.