Keywords

1 Introduction

The aim of this paper is twofold. First, we aim to give an overview of the data issued during the BioASQ challenge in 2019. In addition, we aim to present the systems that participated in the challenge and evaluate their performance. To achieve these goals, we begin by giving a brief overview of the tasks, which took place from February to May 2019, and the challenge’s data. Thereafter, we provide an overview of the systems that participated in the challenge. Detailed descriptions of some of the systems are given in the workshop proceedings. The evaluation of the systems, which was carried out using state-of-the-art measures or manual assessment, is the last focal point of this paper, with remarks regarding the results of each task. The conclusions sum up this year’s challenge.

2 Overview of the Tasks

The challenge comprised two tasks: (1) a large-scale biomedical semantic indexing task (Task 7a) and (2) a biomedical question answering task (Task 7b). In this section a brief description of the tasks is provided focusing on differences from previous years and updated statistics about the corresponding datasets. A complete overview of the tasks and the challenge is presented in [58].

2.1 Large-Scale Semantic Indexing - 7a

In Task 7a the goal is to classify documents from the PubMed digital library into concepts of the MeSH hierarchy. Here, new PubMed articles that are not yet annotated by MEDLINE indexers are collected and used as test sets for the evaluation of the participating systems. Similarly to task 5a and 6a, articles from all journals were included in the test data sets of task 7a. As soon as the annotations are available from the MEDLINE indexers, the performance of each system is calculated using standard flat information retrieval measures, as well as, hierarchical ones. As in previous years, an on-line and large-scale scenario was provided, dividing the task into three independent batches of 5 weekly test sets each. Participants had 21 h to provide their answers for each test set. Table 1 shows the number of articles in each test set of each batch of the challenge. 14,200,259 articles with 12.69 labels per article, on average, were provided as training data to the participants.

Table 1. Statistics on test datasets for Task 7a.

2.2 Biomedical Semantic QA - 7b

The goal of Task 7b was to provide a large-scale question answering challenge where the systems had to cope with all stages of a question answering task for four types of biomedical questions: “yes/no”, “factoid”, “list” and “summary” questions [5]. As in previous years, the task comprised two phases: In phase A, BioASQ released 100 questions and participants were asked to respond with relevant elements from specific resources, including relevant MEDLINE articles, relevant snippets extracted from the articles, relevant concepts and relevant RDF triples. In phase B, the released questions were enhanced with relevant articles and snippets selected manually and the participants had to respond with exact answers, as well as with summaries in natural language (dubbed ideal answers). The task was split into five independent batches and the two phases for each batch were run with a time gap of 24 h. In each phase, the participants received 100 questions and had 24 h to submit their answers. Table 2 presents the statistics of the training and test data provided to the participants. The evaluation included five test batches.

Table 2. Statistics on the training and test datasets of Task 7b. All the numbers for the documents and snippets refer to averages.

3 Overview of Participants

3.1 Task 7a

For this task, 12 teams participated and results from 30 different systems were submitted. In the following paragraphs we describe those systems for which a description was available, stressing their key characteristics. An overview of the systems and their approaches can be seen in Table 3.

Table 3. Systems and approaches for Task 7a. Systems for which no description was available at the time of writing are omitted.

The National Library of Medicine (NLM) team, in its “ceb” systems [48], adopts an end-to-end deep learning architecture with Convolutional Neural Networks (CNN) [27] to improve the results of the Medical Text Indexer (MTI) [35]. In particular, they combine text embeddings with journal information. They also consider information about the years of publication and indexing, to capture concept drift and variations in the MeSH vocabulary respectively. They also experiment with an ensemble of independently trained DL models.

The Fudan University team builds upon their previous “DeepMeSH” systems, which are based on document to vector (d2v) and tf-idf feature embeddings [43], the MESHLabeler system [28] and learning to rank (LTR). This year, they incorporate AttentionXML [66], a deep-learning-based extreme multi-label text classification model, in the “DeepMeSH” framework. In particular, AttentionXML combines a multi-label attention mechanism, to capture label-specific information, with a shallow and wide probabilistic label tree (PLT) [18], for improved efficiency.

The “Iria” systems [52] are based on the same techniques used by their systems for the previous version of the challenge which are summarized in Table 3 and described in the corresponding challenge overview [38].

The “MeSHProbeNet-P” systems are upgraded versions of MeSHProbeNet [61], which participated in BioASQ6 with the name “xgx”. Their approach is based on an end-to-end deep learning model with an encoder-decoder architecture. The encoder consists of a recurrent neural network with multiple attentive MeSH probes to extract different aspects of biomedical knowledge from each input article. In “MeSHProbeNet-P” the attentive MeSH probes are also personalized for each biomedical article, based on the domain of each article as expressed by the journal where it has been published.

Finally, the “Semantic NoSQL KE” system variants [37] were developed extending previous year’s “SNOKE” systems. The systems are based on the ZB MED Knowledge Environment [36], utilizing the Snowball Stemmer [1] and the UIMA [56] ConceptMapper to find matches between MeSH terms and words in the title and abstract of each target document, adopting different matching strategies. Paragraph Vectors [24] trained on the BioASQ corpus are used to rank and filter all the MeSH headings suggested by the UIMA-based framework for each document.

Similarly to the previous year, two systems developed by NLM to assist the indexers in the annotation of MEDLINE articles, served as baselines for the semantic indexing task of the challenge. MTI [35] with some enchantments introduced in [67] and an extension of it, incorporating features of the winning system of the first BioASQ challenge [59].

3.2 Task 7b

The question answering task was tackled by 73 different systems, developed by 18 teams. In the first phase, which concerns the retrieval of information required to answer a question, 6 teams with 23 systems participated. In the second phase, where teams are requested to submit exact and ideal answers, 13 teams with 52 different systems participated. An overview of the technologies employed by each team can be seen in Table 4.

Table 4. Systems and approaches for Task7b. Systems for which no information was available at the time of writing are omitted.

The “AUTH” team participated in both phases of Task 7B, with focus on phase B. For the document retrieval task, they experimented with approaches based on the BioASQ search services and ElasticSearch, querying with the conjunction of words in each question for the top 10 documents. In Phase B, for factoid and list questions they used updated versions of their BioASQ6 system [11], based on word embeddings, MetaMap [3], BeCAS [40] and WordNet. For yes/no questions they experiment with different deep learning methods, based on ELMo embeddings [46], SentiWordnet [12] and similarity matrices to represent the question/answer pairs and use them as input for different BiLSTM architectures [11].

The “AUEB” team participated in Phase A on document and snippet retrieval tasks yielding great results. They built upon their BioASQ6 document retrieval systems [6, 29], which they modify to yield a relevance score for each sentence and experiment with BERT and PACRR [30] for this task. For snippet retrieval, they utilize a BCNN [64] model and a model based on POSIT-DRMM (PDRMM) [30]. They also introduce JPDRMM, a novel deep learning approach for joint document and snippet ranking, based on PDRMM [42].

Another approach based on deep learning methodologies for Phase A, focusing again on document and snippet retrieval, was proposed by the “MindLaB” team from the National University of Colombia [47]. For the document retrieval they use the BM25 model [53] and ElasticSearch [15] for efficiency, along with a Word Mover’s Distance [22] based re-ranking scheme. For snippet retrieval, as in the previous approach, they utilized a very large collection of PubMed articles to train a CNN with similarity matrices of question-answer pairs. More specifically, they employ the BioNLPLabFootnote 1 w2vec embeddings that take into account the Part of Speech of each word. Also, they deploy the QuickUMLS [55] tool to create a cui2vec embedding for each snippet.

The “_sys” systems also participated in Phase A of Task 7B. These systems filter the queries, using stop-word lists and regular expressions, and expand them using word embeddings and pseudo-relevance feedback. Relevant documents are retrieved, utilizing Query Likelihood with bigrams and BM25, and reranked, based on Latent Semantic Indexing (LSI) and document vectors. In particular, document vectors based on averaging sentence embeddings are adopted. Finally, different lists of documents are merged to form the final result, considering the position of the documents in each list.

In phase B, most systems focused on using embeddings and deep learning methodologies to tackle the tasks. For example the “BJUTNLP” system utilizes the SQUAD Dataset for pre-training. The system uses both GloVe embeddings [45] (fine tuned during training) and character-level word embeddings (through a 1-dimensional CNN) as input to a BiLSTM model and for each question a Pointer Network [54] is finally responsible for pinpointing the exact start and end position of the answer in the relevant snippets.

The “BIOASQ_VK” systems were based on BioBERT [25], but with novel modifications to allow the model to cope with yes/no, factoid and list questions [41]. They pre-trained the model on the SQUAD dataset (for factoid and list questions) and SQUAD2 (for yes/no questions) to leverage the small size of the BioASQ dataset and by exploiting different pre-/post-processing techniques they obtained great results on all subtasks.

The “DMIS” systems focused on the importance of the information (words, phrases and sentences) for a given question [65]. To this end, sentence level embeddings based on ELMo embeddings [46] and attention mechanisms facilitated by Dynamic Memory Networks (DMN) [21] are deployed. Moreover, sentiment analysis is performed on yes/no questions to guide the classification (positive corresponds to yes) using the NLTK-VADER [17] tool.

The “google” systems [16], focus on factoid questions and are based on BERT based models [9], specifically the one in [2] trained on the Natural Questions [23] dataset, while also utilizing the CoQA [50] and the BioASQ datasets. They experiment with different input to the models, including the abstracts of relevant articles, the provided gold snippets and predicted relevant snippets. In particular, they focus on error propagation in end-to-end information retrieval and question answering systems, reaching the interesting conclusion that the information retrieval part is a bottleneck for such end-to-end QA systems.

Interesting results come from the “L2PS” team where they quantify the importance of pre-training and fine-tuning models for question answering and view the task under different regimes, namely Reading Comprehension (RC) and Open QA [19]. For the RC regime they use DRQA’s document reader [7] while for the Open QA they utilize the PSPR model [26]. They experiment with different datasets (SQUAD [49] for RC and Quasar-T [10] for Open QA) for fine-tuning the models, as well as BioBert [25] embeddings to gain insights on the effect of the context length in this task.

The “LabZhu” [44] systems improved upon their systems from BioASQ6, with focus on exact answer generation. In particular, for factoid and list questions they developed two distinct approaches. One based on traditional information retrieval approaches, involving candidate answer generation and ranking, and one Knowledge-Graph based approach. In the latter approach, the answer type and the topic entity of the question are predicted and a SPARQL query is generated based on them and used to retrieve some results from the Knowledge Graph. Finally, the results of the two approaches are combined for the final answer of the question.

The Macquarie University (“MQU”) team focused on ideal answers and approached the task under a classification approach for snippet relevance [33]. Extending their previous work [31, 32] the snippets are marked as summary relevant or not, utilizing w2vec embeddings and tf-idf vectors of the question-sentence pairs, showcasing that a classification scheme is more appropriate than a regression one. Also, based on their previous work [34], they conduct experiments using reinforcement learning towards the ROUGE score of the ideal answers and a correlation analysis between various ROUGE metrics and the BioASQ human evaluation scores, observing poor correlation of the ROUGE-Recall score with human evaluation.

The “UNCC” team focused on factoid, list and yes/no questions [57]. Their work is based on the BioBERT [25] embeddings fine-tuned on previous years of BioASQ. They also utilize the SQUAD dataset for factoid answers and incorporated the Lexical Answer Type (LAT) [13] and POS-tags along with hand made rules to address specific errors of the system. Furthermore, they incorporated the entailment of the candidate sentences in yes/no questions using the AllenNLP library [14].

Finally, the “unipi-quokka-QA” system tackled all the different question types in phase B [51]. Their work focused on experimenting with different Transformer models and embeddings, namely: ELMo, ELMo-Pumbed, BERT and BioBERT. They used different strategies depending on the question type, such as ensembles on yes/no questions, biomedical named entity extraction (using SciSpacy [39]) on list questions and different pre-/post-processing procedures.

In this challenge too, the open source OAQA system proposed by [63] served as baseline for phase B. The system which achieved among the highest performances in previous versions of the challenge remains a strong baseline for the exact answer generation task. The system is developed based on the UIMA framework. ClearNLP is employed for question and snippet parsing. MetaMap, TmTool [60], C-Value and LingPipe [4] are used for concept identification and UMLS Terminology Services (UTS) for concept retrieval. The final steps include identification of concept, document and snippet relevance, based on classifier components and scoring, ranking and reranking techniques.

4 Results

4.1 Task 7a

Each of the three batches of Task 7a were evaluated independently. The classification performance of the systems were measured using flat and hierarchical evaluation measures [5]. The micro F-measure (MiF) and the Lowest Common Ancestor F-measure (LCA-F) were used to choose the winners for each batch [20].

According to [8] the appropriate way to compare multiple classification systems over multiple datasets is based on their average rank across all the datasets. On each dataset the system with the best performance gets rank 1.0, the second best rank 2.0 and so on. In case two or more systems tie, they all receive the average rank. Table 5 presents the average rank (according to MiF and LCA-F) of each system over all the test sets for the corresponding batches. Note, that the average ranks are calculated for the 4 best results of each system in the batch according to the rules of the challenge.

Table 5. Average system ranks across the batches of the Task 7a. A hyphenation symbol (-) is used whenever the system participated in fewer than 4 tests in the batch. Systems with fewer than 4 participations in all batches are omitted.

The results in Task 7a show that in all test batches and for both flat and hierarchical measures, some systems outperform the strong baselines. In particular, The “MeSHProbeNet-P” systems achieve the best performance in the first batch, outperformed by the “DeepMeSH” systems in the last two batches. More detailed results can be found in the online results pageFootnote 2. Comparison of these results with corresponding system results from previous years reveals the improvement of both the baseline and the top performing systems through the years of the competition as shown in Fig. 1.

Fig. 1.
figure 1

The micro f-measure achieved by systems across different years of the BioASQ challenge. For each test set the micro F-measure is presented for the best performing system (Top) and the MTI, as well as the average micro f-measure of all the participating systems (Avg).

4.2 Task 7b

Table 6. Results for snippet retrieval in batch 4 of phase A of Task 7b.
Table 7. Results for document retrieval in batch 3 of phase A of Task 7b. Only the top-10 systems are presented.
Table 8. Results for batch 5 for exact answers in phase B of Task 7b. Only the top-10 systems are presented along with the BioASQ baseline.

Phase A: For phase A and for each of the four types of annotations: documents, concepts, snippets and RDF triples, we rank the systems according to the Mean Average Precision (MAP) measure. The final ranking for each batch is calculated as the average of the individual rankings in the different categories. In Tables 6 and 7 some indicative results from batches 3 and 4 are presented. Full results are available in the online results page of Task 7b, phase AFootnote 3. These results are preliminary. The final results for Task 7b, phase A will be available after the manual assessment of the system responses.

Phase B: In phase B of Task 7b the systems were asked to produce exact and ideal answers. For ideal answers, the systems will eventually be ranked according to manual evaluation by the BioASQ experts [5]. Regarding exact answersFootnote 4, the systems were ranked according to accuracy, F1 score on prediction of yes answer, F1 on prediction of no and macro-averaged F1 score for the yes/no questions, mean reciprocal rank (MRR) for the factoids and mean F-measure for the list questions. Table 8 shows the results for exact answers for the last batch of Task 7b. These results are preliminary. The full results of phase B of Task 7b are available onlineFootnote 5. The final results for Task 7b, phase B will be available after the manual assessment of the system responses.

The results presented in Fig. 2 show that this year the performance of systems in the yes/no questions, has clearly improved. In batch 5 for example, presented in Table 8, some systems outperformed the strong baseline based on previous versions of the OAQA system, with the top system achieving almost double the score of the baseline. Some improvement is also observed in the performance of the top systems for factoid and list questions in the preliminary results. However, there is even more room for improvement in these types of question as can be seen in Fig. 2.

Fig. 2.
figure 2

The performance achieved by systems in exact answer generation part of Task B, Phase B, across different years of the BioASQ challenge. For each test set the performance of the best performing system (Top) is presented based on the official evaluation measures. Since BioASQ6 the macro-averaged F1 score (macro F1) is the official measure for Yes/No questions, but accuracy (Acc), the former official measure, is also presented. The results for BioASQ7 are preliminary. The final results for Task 7b, phase B will be available after the manual assessment of the system responses.

5 Conclusions

In this paper, an overview of the seventh BioASQ challenge is presented. The challenge consisted of two tasks: semantic indexing and question answering. Overall, as in previous years, the best systems were able to outperform the strong baselines provided by the organizers. This suggests that advances over the state of the art were achieved through the BioASQ challenge but also that the benchmark in itself is challenging. Moreover, the shift towards systems that incorporate ideas based on deep learning models observed in the previous year, is even more clear. Novel ideas have been tested and state-of-the-art deep learning methodologies have been adapted to biomedical question answering with great results. Specifically, the breakthroughs in different NLP tasks using clever techniques with the advent of new language-models, such as BERT and gpt-2, gave birth to new approaches that significantly boost the performance of the systems. In the future, we expect novel methodologies, such as the newly proposed XLNet [62], to further cultivate research in the biomedical information systems field. Consequently, we believe that the challenge is successfully pushing the research frontier of this domain. In future editions of the challenge, we aim to provide even more benchmark data derived from a community-driven acquisition process.