1 Introduction

Health care is gradually shifting from clinician to patient-centered. The patient’s role has changed from passive health information recipients to active information seekers and even information providers. Patient decision-making is a critical component of patient-centered health care. The World Health Organization has stated that patient involvement in care is not only desirable but also a social, economic, and technical necessity [42]. A patient’s most important type of support is to obtain suggestions and information when they make decisions during their treatment [23]. However, clinicians tend to focus on the clinical impact of disease and may ignore patients’ emotional well-being and daily life.

In today’s digital world, it is increasingly common for patients to join online health communities (OHCs) to share information and acquire knowledge and support each other users. OHCs contain multiple message boards or forums used by group of users to share concerns about health problems and needs [14]. The goal of such communities is to support patients with chronic conditions and provide them with the opportunity to understand their medical conditions [28]. The user’s aim is to interact with patients who have common problems to share their health conditions and know how they overcome similar situations. OHCs contain enormous amounts of patient-generated content in the form of threads on various health-related topics. A thread is started by a person with a question, and responses are posted to that thread. Many patients find help in decision-making by asking questions as their illness progresses. If the patient can locate the relevant information that is already discussed, they can experience a quick and effective information search, increase their level of satisfaction, and thereby make a quick and informed decision. Nevertheless, an information search is a significant challenge for patients attempting to locate relevant information from experienced persons within the OHCs owing to information overload. Therefore, an answer recommendation can overcome this challenge.

In general, a recommendation system recommends items or products to users based on their interests or preferences. A conventional recommendation system relies on two techniques: content-based technique and collaborative filtering technique [2]. Whereas content-based techniques recommend items based on the description of items and user preferences, a collaborative-based technique recommends the items based on similar user preferences. For example, in traditional question-answering forums like Stack Overflow or Quora, the best answers are recommended to a user query based on the voting of community members [39]. Rating/voting is an essential feature used by commercial websites/forums to recommend items [45]. However, traditional recommendation techniques cannot be applied to OHCs owing to the absence of rating/voting information and the lack of a “best answer,” and only experiences and opinions are shared instead. On these grounds, an automatic answer recommendation system in the current study provides valuable information and suggestions from posts of experienced patients responding to individual queries.

For the OHC, the current study considered a large breast cancer forum, Breastcancer.org, for recommending answers to patient queries. Whereas early detection and timely and efficient care for breast cancer are increasingly made available, it is also essential to learn how to handle the condition and preserve the quality of daily life. The factors associated with unmet needs in post-treatment cancer survivors were identified in [26]. The largest factor identified is ‘being informed about the things one can do to help yourself to get well’. Hence, seeking and providing support for similar patients is a critical requirement for patients living with cancer. A recent work [3] presents a machine learning approach for discovering health care services created by multiple stakeholders in a social media group. They examined Twitter data on the Arabic language in cancer diseases. Another study [7] looked at the sentiment dynamics of cancer patients in a social media forum by analyzing the patients’ posts. A large number of patients have joined any of these forums and have discussed various concerns. However, a user having a similar problem to that discussed in a thread might not want to read all of the posts. Often, threads contain several answer posts and the majority of text includes sympathy as well as stories. Information search is a significant challenge for them when they attempt to locate relevant information within these contents. Hence, user may prefer a brief summary of discussion. Additionally, they prefer to get relevant information according to their specific requirements. So, summarizing these posts is a necessity.

The current study consists of three phases to recommend answer to patient queries. The first phase of answer recommendation is to find a similar query in the threads. The second phase is to summarize the answer posts and third phase is to recommend the answer. A deep learning architecture, a Siamese network, is used to find the closest query. If similar query is not found, the patient can initiate a new thread/or the moderator of the forums is requested to respond. Summarization techniques used for an answer recommendation include an extractive summarization.

The main objective of the present study is to recommend an optimal answer to a user query, if similar concern has already been raised and is discussed in the forum. To recommend the answer, the following tasks are executed.

  1. i)

    Build an effective model to find similar query from the repository by effectively capturing the medical knowledge contained in the query.

  2. ii)

    Generate a summary from the answer posts in the thread by optimally selecting the most informative sentences without losing the relevant information.

The rest of the paper is organized as follows: Section 2 describes previous studies on summarization and query similarity tasks. Section 3 explains the overall architecture of the system, query similarity model and summary generation technique. Section 4 discusses the experiment and results analysis. Finally, in Section 5, the discussion of the study are described, followed by the conclusion and future direction.

2 Related studies

Generally, the application of recommender system in healthcare targets two types of end users, i.e., patients and healthcare professionals [43]. Whereas, health professionals’ benefit from clinical guidelines or research articles in treatment and diagnosis, patients receive suggestions such as diet, exercise, life style recommendations. For instance, using personalized content, Farrell et al. [15] recommended life style changes and Roitman et al. [33] recommended patient safety. The authors of [13, 34] targeted diabetic patients to improve their eating, exercise, and sleeping habits. The authors of [36] highlighted the importance of Health Recommender System (HRS) and proposed a system for assisting in the decision - making processes for both patients and physicians.

To the best of our knowledge, only few studies related to the answer recommendation in healthcare field have been reported. In [41] the authors discussed a medical community question answer system in which the best answer post from a collection of answer posts to a similar question is recommended by considering the quality of the answers, instead of summarizing the answer posts. However, the authors in [9] recommended best posts as summary using extractive summarization technique based on textual features of posts. By contrast, the current study generates a summary by selecting relevant sentences from all the answer posts.

In general, the techniques for text summarization are extractive or abstractive. In an extractive summarization, key sentences are selected from the original document while keeping the sentence intact, whereas with abstractive summarization new sentences are constructed from the original text by understanding the content [6, 8]. Hence, abstractive summarization is complex. Summarization can also be categorized as either generic or query dependent [32]. Generic summarization contains the major content of document, whereas query-based summarization generates content most relevant to a given query. Because discussion forums started with queries from the user, our approach to summarization in the current study was query-based and the technique used was an extractive summarization.

It is important to determine the semantic relatedness between the current query and archived queries, since the recommendation is the most relevant answer to a user query. Several techniques have been used in the literature to measure the textual semantic similarity. Although earlier researchers have relied on linguistic features [16], focus has shifted to neural network techniques. The introduction of word embedding model is a revolutionary approach to a semantic similarity task, including, cosine similarity in sent2vec [30], Word2vec, Doc2vec [21], Glove vectors [4]. The latest addition to these types of embeddings is contextual word embeddings such as BERT and ELMO. Devlin et al. [12] reported that contextual embedding is more effective than the above word embeddings in textual entailment task. It is because, whereas word2vec or doc2vec generate the same vector for the same word in different contexts, BERT generates distinct vectors for the same word in different contexts. BERT’s key benefit is that it uses bi-directional learning to get word context, which means it understands word context by reading it both ways from left to right and right to left at the same time. Following recent trends, several deep learning architectures such as GRU, RNN [20], Tree-LSTM [38] have been employed to solve the semantic similarity task. Yet, another study in [27] used two identical LSTM networks, called Siamese network, to measure the semantic relatedness between sentences. It was reported that more promising results are yielded, than in other neural networks. However, these studies were based on large amount of training data, that is, large number of semantically related textual data. For Instance, the studies in [10, 11] are such types of question-question matching task on online user forums. However, these approaches do not perform well in online health community forums owing to the presence of many medically related terms. In addition, there are no large medical query-query pair datasets available. This challenge was recently handled in [25] by applying a transfer learning technique, in which a double fine-tuning approach was implemented. First, the model was trained with large medical question-answer data set and then, fine-tuned with a small question-question data set. Hence, to capture the medical knowledge contained in the query, the current study adopted the concept of transfer learning with a double fine-tuning approach.

To generate the most concise summary related to query, several optimization techniques have been considered in the literature. Relevant sentence extraction, redundancy between sentences and content coverage are the critical issues in forming the summary. Genetic algorithm (GA) [17, 46], differential evolution (DE) [19], and particle swam optimization [5, 31] are some of the optimization techniques proposed. Rautray and Balabantaray [32] proposed a Cuckoo search algorithm for multi-document summarization. However, these are all metaheuristic approaches that require large number of parameter tuning and higher computational effort. This motivates us to develop an optimal sentence selection method that contains minimal parameter tuning. In [40], the authors proposed an optimal combination of sentence scoring methods to rank the sentence to be included in the final summary.

Based on the findings of all of the above studies, we determined that providing the best possible response to a user question is critical. To do this task, we must first develop an effective model for locating a query that is similar in the repository. It’s also crucial to capture the medical knowledge provided in the query. In the repository, we need to store an optimal answer for each query. To fulfil this objective, the answer must be derived from the reply postings in the respective thread. As a result, it is necessary to choose the most informative sentences without losing essential information.

3 Methodology

This section describes the system for recommending answer to a particular query related to breast cancer. The recommended answer is the summary of a discussion in the forum by experienced patients, moderators of forums and survivors. These answers contain experiences, suggestions, solutions, viewpoints and above all, how they managed a particular situation. The Siamese network with transfer learning technique is implemented to find the similar query. For each similar query, an answer is generated, using extractive summarization technique with an optimization algorithm, and stored in the repository.

3.1 Dataset

The dataset used for the study was taken from a large OHC, Breastcancer.org [18], a platform mainly formed by breast cancer patients, survivors and caregivers. The site is organized into different forums, each consisting of hundreds of threads. The forums considered in the study dealt with “stage I,” “stage II, “stage III,” “stage IV,” “chemotherapy,” “radiation,” “surgery,” “reconstruction,” “DCIS,” “ILC,” “triple negative,” “HER2+,” and “employment and insurance” etc. Although some of the forums had fewer threads, we took into account all of the threads from these forums. Each thread starts with a user query and the answers made by survivors and caregivers in the form of posts. Each thread contains hundreds of posts. However, there were some queries/threads, that were replied with one answer. Some were considered according to their significance. On average, a query consists of 5 answer posts. There was an average of five sentences per post. The medical history of users is, by default, private according to the site’s policy. Permission to collect these posts was granted by the site administrators and posts were scrapped with python and stored in the database. Nearly, 500 queries and their answers were collected from the threads. Two graduate students in health informatics with clinical backgrounds, trained on summarization strategies, manually created the summary. They read each of the answer posts in the thread and extract the relevant sentences as summary. The sentences that differ in meaning and are more similar to the query were chosen. The summary was later validated by clinical experts in the team. The manually created summary is here referred to as, Breast Cancer-Query-Answer pair (BC-QA) dataset. A sample query, its few answers and manually created summary are shown in Table 1.

Table 1 A sample query and its manual summary

For the query similarity check, the data set was the same 500 queries that were used for summarization task. For each query, a similar and a dissimilar query pair was created under the following condition: If the thread, from where the query was taken, consists of similar query, then that was considered as similar one. It is because mostly in the same thread during the course of discussion, users may ask similar questions in different way. If it doesn’t contain any such query, rewrite the original query in different way by changing the structure of sentence as much as possible without losing the meaning. For the dissimilar query pair, the queries from different forums are considered. In this way, a dataset of 500 queries with similar, dissimilar pairs was constructed, which is referred to herein as Breast Cancer-Query Query pair (BC-QQP) data set. A sample is shown in Table 2.

Table 2 A sample query and its similar and dissimilar pairs

For the transfer learning technique in query similarity model following dataset were used:

  • The MedQuAD dataset [1], consists of 47,457 question-answer pairs constructed from 12 trusted resources such as NCI (National Cancer Institute), MedlinePlus Health Topics, MedlinePlus Drug etc.

  • WebMD dataset [29], which is an online publisher of medical information including articles, videos, and frequently asked questions (FAQ). By using a publicly available crawl over the FAQ of WebMD 46,872 question-answer pairs were extracted.

3.2 Architecture of the proposed system

The architecture of the proposed answer recommendation system is shown in Fig. 1. The first phase of the system is to find a similar archived query for each of the user’s new query from the archived query-answer pair. A query similarity model-a Siamese network model, which is very efficient for finding the similarity between text, is used during this phase. During the second phase a summary is generated by summarization process and archived in a query-answer pair repository. The summarization process includes an extractive summarization technique with optimization in the threaded posts. Finally, in the third phase, the corresponding answer is fetched and recommended to the user. Each of these phases is elaborated upon in Sections 3.3, 3.4, and 3.4.4 respectively.

Fig. 1
figure 1

Proposed Architecture of Recommender System

The overall problem can be stated as follows: Let a new query from a user be qnew and summary repository contain a set R = {(Q, S)}, where {(Q, S)} = {(q1, s1), (q2, s2) …. (qn, sn)} of n query-summary pairs and let Q = q1, q2, … qn be n queries and S = s1, s2sn be n corresponding summaries. If qnew is similar to any of query in query-answer pair repository, qnew ≈ qj where qj ∈ qn, then corresponding summary sj∈ sn, be recommended to the user.

3.3 Query similarity model using Siamese network

The query similarity model is used to determine the semantic relatedness between two medical queries. As stated in the Related Studies section, the concept of transfer learning enabled us to accomplish this task. A double fine-tuning approach from the transfer learning method was used during this phase [44]. The model was first fine-tuned using a large corpus of medical question-answering dataset and then fine-tuned using our small set of labeled question-question paired dataset. The objective is to integrate medical information into the model so that it can interpret the semantics of each question. First, the Siamese architecture is described and then it is followed by the transfer-learning technique.

3.3.1 Siamese architecture

A Siamese network classifies the query as similar or dissimilar. Figure 2 shows the architecture of Siamese network. The architecture consists of two identical sub-networks. Then, the absolute difference between both the representations is calculated, and similarity score is generated by the final sigmoid layer.

Fig. 2
figure 2

Query similarity model - Siamese architecture with LSTM

The inputs to the sub-networks are two queries consisting of sequence of words represented by q1 = (\({x}_1^1,{x}_2^1\), ….\({x}_n^1\)) and q2 = (\({x}_1^2,{x}_2^2\), ….\({x}_m^2\)). Here, q1 is the current query and q2 is the archived query, from query-answer pair repository; \({x}_i^1\) and \({x}_i^2\) are word sequences in first and second query respectively, n and m are number of words. The first layer in the model is the embedding layer, which is a BERT embedding technique. BERT is a bidirectional language understanding model trained on large corpora of English language text. The queries are directly fed to the embedding layer and they are embedded as vectors of 768 dimensions using BERT-Base model. They are then fed to LSTM layer above. To investigate the quality of embedding technique, Bio-BERT was also used in this experiment. Bio-BERT is the general BERT model pre-trained on PubMed abstracts and PMC full-text articles for various biomedical text mining tasks [22]. In the second layer, LSTM is used to represent the queries from the BERT vectors. LSTM can naturally manage the word order and word sequence. Each LSTM units comprises many hidden layers of gated cells known as memory cells, that can either remember or forget information. For each word xi, the cell value hi is computed as a linear combination of current input and the previous state as follows:

$${h}_i=f\left({w}_h{h}_{i-1}+{w}_x{x}_i\right)$$
(1)

Where f is the tanh activation function, wh is the weight of the hidden layer, hi-1 is the previous state, wx is the weight of the current input, and xi is the current input word vector. The above calculation is repeated for all hidden states. The final representation of the queries was obtained from the last LSTM units of sub-networks as vectors \({h}_n^1\) and \({h}_m^2\) respectively. Then, element-wise absolute difference between two vectors is calculated by \(\left\Vert {h}_n^1-{h}_m^2\right\Vert\) and the value is finally fed to the dense layer where the sigmoid function is used to obtain the similarity score, S, which has the values from 0 to 1. If S ≥ 0.5, the predicted label Y is 1 (the queries are similar), and if S < 0.5, the predicted label Y is 0 (the queries are dissimilar). Binary cross entropy is used as the loss function for each question pair and is defined as follows:

$$Loss=\widehat y\;log(S)+\left(1-\widehat y\right)\mathit\;log\left(1-S\right)$$
(2)

Where ŷ is the true label and S is the output probability for label 1 and (1-S) is the output probability for label zero. The stochastic Gradient Descent method is used to update the parameters, weights and biases, of sub-networks and is computed by using backpropagation. The implementation was achieved in python using Keras library.

Bidirectional LSTM (BiLSTM) was also used in the sub-networks for the experiments instead of LSTM. Because LSTM is a one-directional approach, it cannot obtain information from the future sequences of words in a sentence. BiLSTM consists of two LSTMs: one for taking input sentence from left to right, and the other for taking input from right to left. Siamese network with BiLSTM is shown in Fig. 3. The forward LSTM is provided with word sequences from left to right i.e., x1 to xn and backward LSTM is fed with word sequences from right to left i.e., xn to x1. Thus, past data is extracted from first LSTM and future data is extracted from second LSTM. As shown in Fig. 3, the final output \({h}_n^1\) for first sub-network is obtained by concatenating two outputs from the forward LSTM \({h}_n^{1f}\) and backward LSTM, \({h}_n^{1b}\) using Eq. (3). The same procedure is repeated for second sub-network and the final output \({h}_m^2\) is obtained through Eq. (4). The final similarity score S and the label Y are calculated as similar to LSTM network explained above.

Fig. 3
figure 3

Query similarity model - Siamese architecture with Bi-LSTM

$${h}_n^1={h}_n^{1f}\left\Vert {h}_n^{1b}\right.$$
(3)
$${h}_m^2={h}_m^{2f}\left\Vert {h}_m^{2b}\right.$$
(4)

3.3.2 Transfer learning

Transfer learning is implemented as follows. The model is pretrained with MedQuad dataset, which is a large medical question-answer dataset. The objective is to predict the correct answer for a given query. The final tuning is performed with our 500 query-query pair data set (BC-QQP). The queries were highly beneficial for patients undergoing a treatment process. The tuning methods considered the sentence similarity with a binary cross-entropy loss function, as explained in the Siamese network architecture. For experimental purposes, the model was pretrained using the WebMD dataset and finally tuned with the BC-QQP dataset. The results of both the experiments are tabulated in the Experiment and Result section.

3.4 Summary generation

The second phase of our proposed system was summary generation. The answer to a similar query is found from the query-answer pair repository. To accomplish this goal, the answers from the already discussed posts are summarized and stored. In OHCs, it is common for a user to initiate a thread by asking a question, with several experienced users providing different suggestions and opinions based on their experience. The answered posts should be summarized to extract accurate information. As stated in the introduction, the current study used an extractive summarization technique. In this context, summarization can be considered as a multi-document summarization where the user’s answered posts are made up of multiple documents. However, individual posts are much smaller than a stand-alone document, which is typically composed of four to five sentences. Hence, in this work, without losing the context, a summary is created using coherent sentences from the posts.

Let Q = q1, q2, … qn be n queries in threads and P = {p1, p2, ….pm} be a set of m reply posts corresponding to each query qj. Each post pi consists of a set of k sentences, pi = {s1, s2, ….sk}. A summary is generated as S = {\({s}_j^i\)}, where 1 ≤ i ≤ m and 1 ≤ j ≤ k, consists of subset of j sentences from m posts. The summary consists of a meaningful subset of sentences that should be coherent and non-redundant. The full summary generation process is comprised three main steps: pre-processing, sentence ranking, and summary generation as depicted in Fig. 4 and described as follows:

Fig. 4
figure 4

Summary generation

3.4.1 Pre-processing

The first step involved pre-processing the sentences from the posts. This step consisted of sentence segmentation, tokenization, stop-word removal and stemming. In sentence segmentation, each post pi was divided into individual sentences as s1, s2, …sm. Each sentence was then tokenized into different words in the second sub-step, as sj = {w1, w2…wk}. From these words insignificant words called stop-words such as ‘a’, ‘the’, ‘an’ etc. were removed. Finally, the words were converted into their base form using the Porter stemming method. All steps were carried out using python’s NLTK tool kit.

3.4.2 Sentence ranking

After pre-processing, the sentences were converted into a sentence vector using the BERT technique. Later, the Bio-BERT technique was also used for experimental purpose. The similarity score, sim_score, (cosine similarity), between the sentences in each post were calculated as Eq. (5).

$$sim\_ score\left({s}_i,{s}_j\right)=\frac{\sum_{i,j=1}^m{s}_i\kern1em \ast \kern0.5em {s}_j}{\sqrt{\sum_{1=1}^m{s}_i^2}\ast \sqrt{\sum_{j=1}^m{s}_j^2}}$$
(5)

where si and sj are sentence vectors of the corresponding sentences. For each sentence, the similarity score with other sentences was calculated and then averaged to rank the sentences. The sentences were ranked according to ascending order of the average similarity score followed by selecting the sentences with a score less than a threshold value. In this way, the least similar sentences of each post were selected and merged together to form a single document, D. The sentences were arranged according to the order of posts appearing in the thread.

3.4.3 Summary generation

The next step was to generate more optimal summary from a single document, D. The single document from the previous step may contain some similar sentences because it was selected from different posts but discussed a similar topic. Therefore, an optimization technique was applied at this step to generate a more precise and non-redundant summary. An optimization score for each sentence was computed from the three features of the sentences. A good summary should possess three important features: content coverage, non-redundancy and cohesion [32].

Summary, S contains a set of sentences s1, s2, …sn, which should be related to the content of discussed topic. Since this study deals with a query-based summarization, the content coverage was measured based on similarity of the summary with corresponding query qj, which is termed as cont_cov. It was computed using Eq. (6) as follows:

$$cont\_cov(S)= sim\ \left({s}_{avg},{q}_j\right)$$
(6)

where savg is the average sentence vector of summary which contains {s1, s2…sn} sentences and sim (savg, qj) is cosine similarity between savg and qj.

The single document produced from the sentence ranking step contains some redundant sentences. Hence, the dissimilarity among sentences was computed using Eq. (7):

$$non\_redundancy\;(S)=1-\underset{s_{i,j\in S}}{max}\left(sim\left(s_i,s_j\right)\right)$$
(7)

Cohesion is the conceptual relationship among sentences in the summary; that is, the sentences must discuss same idea of the content [37]. According to [40], cohesion is the ratio of average of all similarities of sentences in the summary to the maximum of similarities. It was computed using Eq. (8):

$$cohesion(S)=\frac{\mathbf{log}\left({\boldsymbol {Avg}}_{{\boldsymbol {s}}_{\boldsymbol {i}\boldsymbol{\in}\boldsymbol {S}}}\left(\boldsymbol {sim}\left({\boldsymbol {s}}_{\boldsymbol {i}}\right)\right)\ast\mathbf{9}+\mathbf{1}\right)}{\boldsymbol {log}\left(\boldsymbol {Max}_{{\boldsymbol {s}}_{\boldsymbol {i}\boldsymbol{\in}\boldsymbol {S}}}\left(\boldsymbol {s}\boldsymbol {i}\boldsymbol m\left({\boldsymbol {s}}_{\boldsymbol {i}}\right)\right)\ast\mathbf{9}+\mathbf{1}\right)}$$
(8)

where \({Avg}_{s_{i\in S}}\Big( sim\left({s}_i\right)\)) is the average similarity of all sentences in the summary, and \(Max_{s_{i\in S}}\left( sim\left({s}_i\right)\right)\) is the maximum of similarity value among sentences.

Then, an optimization score, f(S), was formulated by weighted sum of the above equations as follows:

$$f(S)=\upalpha\ \left( cont\_cov(S)\right)+\frac{1-\upalpha}{2}\left( non\text{-} redundancy\right)+\frac{1-\upalpha}{2}\left( cohesion(S)\right)$$
(9)

where the value of α is manually defined between 0 < α < 1. Here, more weight was provided to the content coverage and equal weight was given to non-redundancy and cohesion. Hence, α was given a value of 0.5 and, weights to the two other components non-redundancy and cohesion were given a value of 0.25 each. The algorithm used to generate a summary is shown in Fig. 5. Sentences with an optimization score, f(S) ≥ k as the threshold value were included in the summary. The generated summary was stored in the repository R, as R = {(Q, S)}.

Fig. 5
figure 5

Algorithm for summary optimization

3.4.4 Answer recommendation

In this phase, the query which is more closely related to the user query is found first. Query qj to the user’s new query, qnew was extracted from the query similarity model, qnew ≈ qj, where qj∈ Q. Then, answer S to the matched query is fetched from query-answer pair repository R = {(Q, S)} and recommended to the user.

4 Experiment and results

This section describes the performance evaluation of query similarity model and summarization. We tested and evaluated the performance of twelve query similarity models using BERT and BioBERT embedding techniques as well as transfer learning techniques. We created and evaluated four different summaries using two embedding methods, as well as optimization and non-optimization techniques for summarizing.

4.1 Performance evaluation of query similarity model

The query similarity task is carried out using Siamese network by applying different embedding techniques-BERT and Bio-BERT. The following experiments were conducted to assess the performance of the model. The different word embedding techniques BERT and BioBERT with LSTM and BiLSTM units were implemented to measure the performance. The BC-QQP dataset was directly fed to the Siamese architecture, without any pre-training and the similarity was assessed. The number of LSTM and BiLSTM units was 128 and the number of epochs was 10, with a learning rate of 0.01 and a drop out of 0.2. Eighty percent of data set was used for training and 20% was for testing. The results are listed in Table 3. There were totally 12 models to compare. The base model, model 4, i.e., BiLSTM architecture with Bio-BERT embedding of queries, yielded a better result with an F1-score of 65% than the BERT embedding technique.

Table 3 Evaluation of query similarity

To assess the efficacy of transfer learning, the model was pre-trained with MedQuAD and WebMD dataset and the results were compared. Each query-answer pair in those datasets was labeled as positive (‘1’) and each query with a random another answer was labeled as negative (‘0’). These labeled pairs were embedded with BERT technique and fed to each of the subnet of Siamese architecture. The experimented was also conducted with BioBERT technique. Then the model was trained with 10 epochs to classify the dataset as either ‘1’ or ‘0’. After this pre-training, 80% of BC-QQP training data set was fed to model for fine-tuning. Finally, the model was tested with 20% of test data. A more promising result was obtained with the model, model 8, that was pre-trained using MedQuAD data set, achieving an F1-score of 85.5% (value is highlighted in bold in the Table 3), as compared to WebMD dataset. An improvement in the F1-score of more than 20% was obtained from the base model, model 4. Training loss for 10 epochs is also plotted for both the LSTM(model 7) and BiLSTM(model 8) models, as shown in Fig. 6. The BiLSTM model reduces training loss more effectively than the LSTM model.

Fig. 6
figure 6

Comparison of LSTM and BiLSTM models training loss

4.2 Performance evaluation of summarization

BC-QA dataset consists of 500 queries and its replies. After pre-processing, the sentence vectors for each of the sentence in the reply posts were built by utilizing BERT technique. Then, average cosine similarity of each sentence in the posts was computed and the sentences with an average score of less than 0.5 were chosen for constructing the single document, from which a summary was generated by applying the optimization algorithm. The optimal sentences for summary were selected by choosing a threshold value k as 0.7. To assess the quality of sentence embedding in summary generation, the Bio-BERT technique was too used to generate sentence embedding.

An automatic summary evaluation metric called Recall-oriented Understudy for Gisting Evaluation (ROUGE) [24] was used to test the quality of generated summary. The ROUGE score is calculated based on intersection of N-gram between system-generated summary and manual summary. The score is calculated based on the following equation [35]:

$$ROUGE\text{-}N=\frac{\sum_{\boldsymbol {s}\boldsymbol{\in}\boldsymbol {manual}\;\boldsymbol {summary}}\sum_{\boldsymbol {gram}_{\boldsymbol {n}\boldsymbol{\in}\boldsymbol {S}}}{\boldsymbol {Count}}_{\boldsymbol {match}}\left({\boldsymbol {gram}}_{\boldsymbol {n}}\right)}{\sum_{\boldsymbol {s}\boldsymbol{\in}\boldsymbol {manual}\;\boldsymbol {summary}}\sum_{\boldsymbol {{gram}}_{\boldsymbol {n}\boldsymbol{\in}\boldsymbol {S}}}\boldsymbol {Count}\left({\boldsymbol {gram}}_{\boldsymbol {n}}\right)}$$
(10)

Where gramn is the length of n-gram in summary, countmatch (garmn) is the number of n-gram matches between the manual summary and system-generated summary. A sample of system-generated summary and manual summary are shown in Table 4.

Table 4 A sample of System-generated summary and manual summary

Separate score from ROUGE can be obtained, when n = 1,2,3,4-g matching. Among these scores, more agreement with manual summary was obtained with n = 1, 2 that is, unigram and bigram-based scores. Both the score values were tabulated in Tables 6 and 7. Conventionally, summarization task was carried out by validating with more than one human written summary. More than one team may write the summary, check the agreement with both the team and validate the system generated summary. But with the peculiarity of more medical content in our posts, the summary produced was done by only one clinical team in our study.

To evaluate the effectiveness of our proposed summarization technique, four summaries were generated based on BERT and BioBERT vectorization techniques with and without optimization step. The summary generated after sentence ranking step was considered as summary without optimization. Two different ROUGE scores were compared corresponding to unigram and bigram length of words. Tables 5 and 6 shows Rouge-1 and Rouge-2 scores respectively. Here also, the Bio-BERT embedding technique showed better result over BERT. Note that, with optimization step, Rouge score was significantly improved by a factor of ≈8%. Thus, a Rouge score of 49.1% was obtained using Bio-BERT with optimization step for unigram words. For bi-gram words, a Rouge score of 26.7% was obtained with the same approach.

Table 5 ROUGE-1 score of summaries generated
Table 6 ROUGE-2 score of summaries generated

Precision, Recall, F1-score, and Accuracy metrics are also used to assess the four summaries, which are based on retrieved correct sentences in the system generated summary. They are defined as follows:

  1. 1.

    Precision: is the ratio of number of retrieved correct sentences to the total of retrieved correct sentences and retrieved incorrect sentences in the summary.

  2. 2.

    Recall: is the ratio of number of retrieved correct sentences to the total of retrieved and non-retrieved correct sentences in the summary.

  3. 3.

    F1-score: is the ratio of harmonic between Precision and Recall.

  4. 4.

    Accuracy: is the ratio of total of retrieved correct and non-retrieved incorrect sentences to the total sentences in the summary.

The graphical depiction of these metrics in respect to four summaries is shown in Fig. 7. The Bio-BERT with optimization technique is found to have the maximum accuracy of 82%. Precision, Recall, and F1-scores for the same summary are all quite high when compared to other summaries. The efficiency of the optimization technique is again demonstrated by these outcomes.

Fig. 7
figure 7

Comparison of Precision, Recall, F1-score and Accuracy of four summaries

4.3 Qualitative analysis

To get a better insight into our query similarity model and summarization technique, we performed error analysis. Table 7 shows some instances of query pairs with their actual labels and the labels from the four query similarity models, model 4, model 8 and models 12. The reason for considering these models was the fact that Bio-BERT with Bi-LSTM architecture was better than other architectures. It was noted that all models could understand the query pairs 2 well, but were unable to do so with query pairs 3. In pairs 3, though the question was about the position during rad, the terms ‘arm position’ and ‘rads position’ as well as the negation part (‘but doesn’t look likely’) could be the reason for giving wrong result from all the models. However, the model 8 provided the exact result for all other query pairs.

Table 7 Sample query pairs with actual and predicted labels

In order to clarify the efficiency of the best model, Model 8, in understanding the query pairs, further analysis was carried out. Table 8 demonstrates an analysis of this kind, in which a query was combined with other pairs of slight tweaks. The query pair 3 from Table 7 is taken and updated repeatedly before the model correctly labels it. This sample shows the point at which the model predicts the output correctly and also highlight the need for additional training data. This kind of improved query pairs were not considered for quantitative analysis and these are used for understanding the best model. The above two analyses provide us with a clear insight into the interpretability of our model and the need for additional training data.

Table 8 Sample query pairs with different changes in same pairs

Table 9 shows the qualitative analysis of our proposed summarization technique, with optimization and without optimization, by Bio-BERT embedding. It is noted that without optimization, the second and third lines convey an almost similar meaning. But with the optimization technique, the second sentence has been removed, and the third sentence, which is closer to the query, remains. Compared to the manual summary, this sentence (text in italics - “i finished a year and half ago…”) is an additional one, even though it conveys detailed information. Another point to mention here is that there is an extra portion of the sentences in summary. For instance, “i had a doc tell…” (text in italics) in the first line and “i bet” in the last sentence are additional parts in the summary with optimization. That may be attributed to the fact that the algorithm ultimately treats the sentences as a whole, not any part of it. But it can be well handled at the time of manual formation.

Table 9 Comparison of summary without optimization and with optimization

5 Discussion

5.1 Principal findings

Answer recommendation is extremely popular in community forum sites. However, in health community forums, its applicability has not been extensively studied. A few experiments have been conducted on medical community forums to recommend answers, and they are all focused on the best answer depending on voting system. In most health community forums, despite survivors or expert patients share their experiences, no rating for posts is given. For instance, if a patient expresses a concern or query about a side effect, treatment procedure or a particular situation, others share their experiences, suggestions, positive or negative feedbacks. In this context, the current study is extremely significant in providing a precise answer considering all the responses shared by experienced patients responded to the user’s concern. More samples of system-generated summary with corresponding queries and similar queries are shown in Appendix.

For several reasons, patients are reluctant to share their concerns with health care providers. This may be due to embarrassment, or not even being conscious of the existence of the problem. An example of such a situation is shown Table 10. Hence, in this context, an automated patient-centric recommendation system is a way to enable users to obtain advice or answers addressing their concerns.

Table 10 A Sample query and its answer

From the performance evaluation of the study, the query similarity model with transfer learning is very effective in capturing the medical knowledge in our data set. Since the BC-QQP data set is specifically for breast cancer patients, it contains more drug related and cancer related terms. With the MedQuAD data set, the model can generalize and learn the medical queries well. This is because MedQuAD contains more question-answer pairs that are related to cancer and cancer related drugs than the WebMD dataset. When we compared the findings of our proposed method to some of the studies listed in the related studies, we found that our method is effective. A very recent study in [25] found that similar queries related to COVID-19 achieved an accuracy of 84.2%. Their approach was based on language model BERT and contained more than 3000 COVID-19 query pairs. But with our Siamese network and Bio-BERT embedding technique, our model resulted in a more promising result of 85.5%. When we compared our results to [41], we found that our model outperformed. Their precision in the first 20 results was 86%, whereas our model’s precision in the first 20 results was 89%. The summary generated with optimization also increased the efficacy of the summarization phase. The optimization score for generated summary has achieved by varying the tuning parameter k which was ideal when k ≥ 0.70. When comparing our summarization approach to that of Rautray and Balabantaray [32], the results show a substantial improvement. Their best approach only provided a Rouge-1 score of 0.43, but our technique produced a Rouge-1 score of 0.49.

6 Conclusion and future direction

This is a pioneering study in the area of breast cancer intended to help the patients make quick and informed decisions. The study demonstrated a system for recommending an answer to a particular query related to breast cancer. The recommended answer is the summary of a discussion in the forum by experienced patients, moderators of forums and survivors. These answers contain experiences, suggestions, solutions, viewpoints and above all, how they managed a particular situation. The study contains mainly three phases: similar query retrieval, summary generation and answer recommendation. The Siamese network applied was very successful in finding the similar queries. The Bi-LSTM with Bio-BERT embedding technique outperformed all other models and provided f1-score of 85.5%, a promising performance than many previous studies. The optimization approach utilized was also highly promising in creating the summary, as seen by Rouge1-score, 0.49, during the summarizing phase. The query-answer pairs created in the study offer a great insight into the multiple challenges that patients encounter during the treatment of breast cancer. This information can help clinicians to address the issues they encounter outside clinical matters.

The study was limited to 500 query-answer pairs on the area of breast cancer area and we therefore intend to include a greater number of queries and its specific answers in the future. Although the question similarity model worked well, there were still a few cases in which our model was unable to find a similarity correctly. For instance, two queries – “did you have more work done after the initial diep surgery? i’m wondering if i should expect a lot of swelling in my abdomen from the liposuction?” and “i am having bi-lateral diep flap reconstruction. what kind of help do i really need after i get home?” were predicted to be similar, despite being different. Both queries were about diep flap reconstruction surgery. The first one was focused on post-surgery complications and the second deals with how to manage the post-surgery state. This indicates that the model requires more tuning to handle queries that are closely related but conceptually distinct and to generalize the similarities. Future work should take care of such concerns. The present study can be extended to other health forums to obtain awareness in other areas of concerns.