1 Introduction

In Germany, there are about five applicants per place at medical schools each year (Foundation for Higher Education Admission, 2018). This high popularity of medical education is seen in many other countries as well. In Germany, medical education at public medical schools is free for the learners but very cost-intensive for the government. Thus, responsible government agencies, as well as medical schools, must develop suitable selection procedures. These procedures should be based on criteria with predictive validity of which graduates are most likely to become the best students and, even more importantly, the best doctors. There is, however, not a single commonly agreed upon definition of a good medical student/a good doctor.

To date, the main selection criterion for medical education in Germany is the pre-university (high school) grade point average (pu-GPA; ranging from 0.7 [best] to 4.0 [worst]) (Federal Ministry of Justice & Consumer Protection, 2018). According to a decision by the Federal Constitutional Court (special court reviewing judicial and administrative decisions/legislation for compliance with the German constitution), the selection must be supplemented with at least one criterion that is independent of the pu-GPA (Federal Constitutional Court, 2017). The revision of the selection process thus provided an opportunity to introduce new criteria that cover, in addition to academic capability, (inter)personal skills as well. This is in line with international initiatives and regulations: e.g. Ireland, Australia and New Zealand use non-academic attributes to complement secondary grades as a selection criterion for medical students (O’Sullivan et al., 2017).

One core quality of both a good medical student and doctor appearing in several respective frameworks is empathy. Empathy is a core component of emotional intelligence, and a general consensus among medical education leaders and professional organisations exists on the high importance of empathy for success in medical education and patient care (Hojat et al., 2011). There is evidence for a positive influence of empathy on Objective Structured Clinical Examinations (OSCE) communication scores (Casas et al., 2017). Several studies show a positive influence of doctors’ empathy on health outcomes in patients (Coulehan et al., 2001; Beckman et al., 1994). Although there is no single definition of empathy (Jeffrey, 1994), different stakeholder groups, such as doctors, patients and students themselves, rate ‘interpersonal qualities’ containing empathy consistently as an important feature of good doctors (Steiner-Hofbauer et al., 2018). Mercer and Reynolds define empathy in the context of healthcare as a complex, multidimensional construct that includes understanding the patient, reflecting your understanding, checking whether you understood the patient correctly and acting upon that understanding in a therapeutic way (Mercer & Reynolds, 2002).

The question as to whether empathy is a stable personality trait or an evolving and changing ability with cognitive and emotional aspects is discussed controversially. We cannot be sure whether empathy scores are stable, increasing or declining during medical education, and there is some criticism on self-report measures to assess empathy in medical students and candidates (Colliver et al., 2010; Costa et al., 2013; Ferreira-Valente et al., 2017). Notwithstanding the final outcome of this discussion, it could be worthy of consideration to use empathy as a selection criterion for medical school. If empathy emerges as a trait, it would be preferable to select applicants with high levels of empathy, but also if empathy emerges as evolving and trainable, it would be more worthwhile to have students with high empathy levels to start from to reach even higher levels of empathy.

As long as personal interviews are part of the selection process, the empathy of applicants can be assessed externally. Another widely used possibility would be an empathy self-assessment during the selection procedure. For that, different standardised questionnaires are available, such as the Jefferson Scale of Empathy (JSPE) (Hojat et al., 2002) and the Davis Interpersonal Reactivity Index (IRI) (Davis, 1983). For both instruments, validated German-language versions are available (Neumann et al., 2012) and both have been used widely to assess medical students’ empathy (Quince et al., 2016). None of the two, however, has been validated for selection purposes (Hemmerdinger et al., 2007). Furthermore, self-assessments bear the risk of socially desired answering, especially in selection situations (Edwards, 1957). However, the same problem may occur during selection interviews. Hence, it is until now unclear whether self- or external assessment of empathy is a or the more feasible method in the context of medical student selection.

We therefore aimed to compare the results of two methods of empathy appraisal during the selection process for medical school: self-assessment and external assessment by selection panel members. Furthermore, we aimed to shed light on how the discrepancies between self-assessment and external assessment of empathy can be understood and explained. Therefore, we explored the empathy concepts of selection panel members and compared the results with the empathy concept the self-assessment tool (IRI) is based on.

2 Methods

We conducted a cross-sectional mixed-methods study (sequential explanatory design) (Creswell & Clark, 2017; O’Cathain et al., 2010) with a quantitative study phase conducted in 2016, which resulted in a qualitative study phase to further explore and explain the quantitative findings in 2017.

2.1 Setting

The study was conducted at Lübeck Medical School (LMS), a section of the public University of Lübeck, Germany. About 1,500 students are enrolled in the medical study programme and about 185 freshmen (high school students, in part with a completed vocational training or first university degree) are admitted to LMS each year. Of these, until 2019, 65 (35%) were allocated by a public agency based on pu-GPA or waiting time; 120 (65%) were selected from 240 direct applicants who did not meet the competitive criteria via an internal selection procedure (Auswahlverfahren der Hochschulen, AdH).

The core of the AdH at LMS was a 30-min interview led by two faculty members and one student (selection panel). Selection panel members participated in a mandatory briefing and new selection panel members were encouraged to participate in interview training (Mommert et al., 2020). For the interviews, the selection panel members were provided a standardised interview guide that gave examples of situational and behavioural questions. Five primary dimensions had to be rated by selection panel members: motivation, knowledge about the course of study, social engagement, (self-)reflection and communication. Each dimension contained five items that were rated on a 5-point Likert scale from 0 to 4. The interviews took part each year in the middle of August on the campus of LMS. We collected all data for this study in August 2016 (quantitative data) and 2017 (qualitative data) on site.

2.2 Quantitative study

2.2.1 Participant selection and recruitment

For the survey, we investigated applicants for LMS who participated in the AdH at LMS in August 2016. Overall, 240 applicants were invited to the interviews in 2016, with 228 who accepted the invitation and were asked to take part in the survey (no exclusion criteria). We offered a voucher (5 €) as an incentive for participation.

2.2.2 Data collection

The self-assessment part of the survey was web-based (using SurveyMonkey; Survey Monkey Europe UC, Dublin, Ireland) and was conducted in a computer pool close to the interview location directly after the interview.

For this study, we collected data on age, gender, pu-GPA and a self-assessment of empathy using the German version of the IRI (Neumann et al., 2012). Empathy is measured through four facets covered by four items each on a 5-point Likert scale from 0 = ‘does not describe me well’ to 4 = ‘describes me very well’:

  1. 1.

    Perspective taking (ability to adopt another’s point of view)

  2. 2.

    Fantasy (ability to empathise with fictional characters)

  3. 3.

    Empathic concern (ability for other-oriented emotions, such as pity)

  4. 4.

    Personal distress (self-oriented emotions such as uneasiness, which may occur in close or problematic interpersonal interactions)

Being a self-oriented emotion, personal distress is not considered in the empathy score.

The basic psychometric quality of the German version of the IRI is comparable to the original version (acceptable Cronbach’s alpha for the four subscales; Neumann et al., 2012).

In addition to surveying all applicants, we asked the selection panel members to rate the empathy of each applicant at the end of the interview in addition to the other dimensions mentioned above. 2016 was the first year in which a global empathy rating was used. Empathy was rated on a 5-point Likert scale (0 = not empathic; 4 = utmost empathic) without further introductory text except the request to rate the applicants’ empathy. For this study, we also extracted the score for overall impression (rated by the selection panel on a 5-point Likert scale from 0 to 4).

2.2.3 Data handling

Data from the web survey was imported into and analysed using IBM SPSS Statistics for Windows Version 22.0 (IBM Corp., Armonk, NY, USA). After a plausibility check (e.g. looking for implausible age and pu-GPA information), data from the interview score sheets was matched to the web survey data using consecutive numbers assigned to the participants.

2.2.4 Data analysis

We analysed the data using descriptive statistics. For gender, we calculated percentages, and for continuous variables, we calculated means (M) and standard deviations (SD). We used t tests to compare the means of continuous variables. In order to express bivariate correlation between the empathy self-assessment (IRI sum score) and the empathy assessment by the selection panel members (external assessment), we used Spearman’s ρ. Effect sizes are reported using Cohen’s d. We considered values of < 0.30 small, ≥ 0.30 medium and ≥ 0.50 large effect sizes. All statistical tests were performed two-tailed with an alpha of 0.05.

2.3 Qualitative study

Quantitative data showed a non-correlation of externally and self-assessed empathy, and the high correlation of externally assessed empathy and the score for overall impression (see Sect. 2.3.2). To follow this thread in the quantitative data and find an explanation for this somehow rather surprising finding, a qualitative study to further explore and explain quantitative results was deemed necessary. As the qualitative study was not planned in advance, a 1-year time lag lies between the quantitative and qualitative part of our mixed-methods study.

2.3.1 Participant selection and recruitment

All (n = 48) selection panel members who participated in the AdH in August 2017 (n = 24 faculty members and n = 24 students; n = 23 female and n = 25 male) were eligible for the focus groups. We invited all participants of the mandatory briefing for the selection procedure in July 2017 during a short oral presentation of the study to attend the focus groups. We did not invite the non-attending selection panel members, because we could not give the short oral presentation on our project to them. The briefing session was attended by 30 selection panel members. We offered a voucher (10 €) as an incentive. Participants received (oral and written) information, could ask further questions and gave their informed consent for the focus groups to be recorded and transcribed verbatim and the results to be published anonymously.

2.3.2 Data collection

For our qualitative study, we (TK, JCS and NJP) developed a focus group topic guide (Table 1) including the following topics:

  • Short introduction of interviewer and study

  • Subjective definition and meaning of empathy in the clinical setting

  • Differentiation of empathy and sympathy/overall impression

  • Learnability of empathy during medical education

  • Individual basis for the empathy assessment

  • Evaluation of the usefulness of empathy appraisal during the selection interviews

Table 1 Focus group topic guide

The focus topic guide was pilot tested during two preliminary interviews and two preliminary focus groups (n = 7 participants) with physicians not acting as selection panel members and selection panel members not available during the actual data collection period. During this development phase, it was found to be suitable to subdivide the topic guide into three parts.

All focus groups were conducted in German. The focus groups took part in seminar rooms with no one present besides the participants and researchers. We used digital audio recording to collect the data. In addition, both facilitators made field notes during the focus groups. All focus groups were transcribed verbatim by JCS and a trained research assistant following designated transcription rules (Mayring, 2000). Interviews were anonymised during transcription. To facilitate a distinction during data analysis, faculty members were given the letter ‘P’ and students the letter ‘S’, followed by a consecutive number. The interviewers were marked ‘I1’ and ‘I2’. Transcripts were checked for accuracy by TK. We did not return the transcript to the focus group participants as this does not seem to be the usual procedure in studies using focus groups and qualitative content analysis and would have meant an unduly demand from the participants.

All focus groups were moderated by both TK and JCS. TK is a male MD who is board certified in Family Medicine, MSc Public Health, and qualified as a professor, with extensive experience in the field of medical curriculum research and in conducting focus groups (Kötter et al., 2015, 2020; Kötter, Carmienke, et al., 2014; Kötter, Ritter, et al., 2016; Kötter, Tautphäus, et al., 2014, 2016). JCS, a female, was at the time of the focus groups a psychology student enrolled in the bachelor program of Lübeck University. She was new to facilitating focus groups, but received a detailed briefing from TK. NJP is a female trained psychologist with comprehensive experience in conducting focus groups and interviews, as well as qualitative data analysis (Pohontsch et al., 2017; Pohontsch, Hansen, et al., 2018; Pohontsch, Stark, et al., 2018; Pohontsch, Zimmermann, et al., 2018), holds a PhD degree and works as a postdoctoral researcher.

2.3.3 Data handling

The qualitative data was managed using QCAmap (http://www.qcamap.org/), an open access web application for systematic text analysis in scientific projects based on the techniques of qualitative content analysis. All transcripts were uploaded to the QCAmap account and were accessible for both coders (TK and JCS).

2.3.4 Data analysis

TK, JCS and NJP analysed the qualitative data conjointly using structuring qualitative content analysis (Hsieh & Shannon, 2005). This systematic procedure is used to reduce large amounts of data while preserving and extracting the main content. Deductive categories were derived from the research question and the focus group topic guide. All other categories were built inductively using summarising content analysis while processing the material (Schreier, 2014). The material was read several times by both TK and JCS before coding to ensure familiarity. For coding, transcripts were broken down into fragments of analysis, which could adopt different sizes. Fragment size ranged from part of a sentence to one or more paragraphs depending on the amount of data needed to understand the content and context of the respective fragment.

JCS conducted the inductive coding and theme development process in close consultation with TK. Categories and codes were described in code memos comprising coding rule and typical quotes. The category system was developed in several discussion rounds. In addition to TK and JCS, NJP was involved in the analysis of the qualitative data and interpretation of the results. As findings were summarised over the whole group of participants, participant checking of findings might have been more trouble than worth, especially with respect to the expenditure of the participants’ time and ability to abstract from their own accounts to findings for the whole group. We therefore chose to ensure intersubjective reproducibility and comprehensibility by discussing the results within our interdisciplinary workgroup.

2.4 Reporting

This report was written under consideration of the STrengthening the Reporting of OBservational studies in Epidemiology (STROBE) criteria (von Elm et al., 2008), the Consolidated criteria for reporting qualitative research (COREQ; Tong et al., 2007) and the Good Reporting of A Mixed Methods Study (GRAMMS) guideline (O’Cathain et al., 2008). Citations used in this article were translated by TK and double-checked by JCS. Citations are marked with quotation marks. […] marks omissions or amendments in a citation.

3 Results

3.1 Quantitative study

3.1.1 Participants

After the exclusion of incomplete data sets, a total of 214 empathy self-assessments could be matched to external assessments. Fourteen applicants either did not give written informed consent (participation was voluntary, reasons for non-participation were not collected) or did not fill out the questionnaire completely (response rate: 94%). Of the included individuals, 73% were female. The mean age was 20.7 years (SD = 2.2). The mean pu-GPA was 1.5 (SD = 0.24). There was no missing data for these variables.

3.1.2 Outcomes

The mean empathy score (self-assessment) was 48.12 (SD = 5.33). We observed no statistically significant differences between male and female applicants (t(212) = 1.71, p = 0.09). The mean empathy score (external assessment) was 3.80 (SD = 0.86). Female applicants (M = 3.88) scored significantly higher when compared to male applicants (M = 3.55; t(212) = 2.55, p < 0.01; r = 0.19).

3.1.3 Correlations

We observed no significant associations between the self- and external assessments of empathy (ρ(212) =  − 0.031, p > 0.05). The external assessment of empathy did not correlate with any of the subscales of the self-assessment instrument (IRI). The external assessment of empathy showed, however, a strong significant correlation with the rating of overall impression (ρ(212) = 0.697, p < 0.01).

3.2 Qualitative study

3.2.1 Participants

We conducted six focus groups with two to five selection panel members (overall n = 19, n = 10 students and n = 9 faculty members; see Table 2 for participant characteristics). The main reason for non-participation was time schedule restrictions. The mean duration of the focus groups was 58 min (R = 54–62 min).

Table 2 Characteristics of focus group participants

3.2.2 Main categories

Considering the subjective views of the questioned selection panel members, three main categories, ‘concept of empathy’, ‘distinction from sympathy’ and ‘learnability’, could be identified. Considering the assessment of empathy, two main categories, ‘basis of assessment’ and ‘usefulness’, could be identified (see Table 3 for an overview of categories).

Table 3 Coding tree—main and subcategories

3.2.3 Concept of empathy

Overall, the concept of empathy was found to be very heterogeneous between the interviewed selection panel members. Doctors provided more examples from their day-to-day life when compared to students, who more often referred to communicative aspects of empathy.

Five subcategories (see Table 3) were identified when looking at the concept of empathy of the interviewed selection panel members: sensitivity, being able to put oneself in somebody’s position, helpfulness, individualised communication and active action.

Sensitivity was mentioned as one of the basic mechanisms of empathy. Resonating with your counterpart and being able to feel someone’s feelings were mentioned as key aspects of empathy. Either way, there was disagreement as to whether and how to differentiate empathy from compassion.

‘I would describe empathy basically as sensitivity and have the opinion, that it is essentially the ability to perceive the emotions of the vis-à-vis’. (Faculty member)

‘Empathy doesn’t mean to live through the emotions’. (Student)

‘Sympathise, but not actively commiserate, that is empathy for me’. (Faculty member)

‘And then empathise, commiserate, resonate with the counterpart. I can’t differentiate compassion so clearly’. (Faculty member)

The most important aspect of empathy from the selection panel members was the ability to put oneself in somebody’s position, to be aware of someone else’s feelings, mood and need for help. Adequate assessment of someone’s position and neediness was also relevant.

‘Empathy is the ability to put oneself into the emotional state of the other person’. (Faculty member)

‘For me, compassion means being able to put oneself in the position of one's fellow human being, as if one were also affected, whereby this ‘as if’ is very important’. (Faculty member)

‘It is about understanding the other who has problems or open questions or conflicts. That one correctly takes in what has been said’. (Faculty member)

‘Not only have compassion but somehow understand it properly’. (Student)

‘Empathy also means being able to recognise the mood or the need for help of others’. (Student)

The concept of individualised communication includes the ability to fit someone’s behaviour and communication style to the counterpart’s needs, showing understanding and keeping communication on equal terms and appreciative. Being authentic, showing social competencies and appropriate physical closeness (touching) were also mentioned.

‘That you can recognise something and also respond to it without being told directly’. (Student)

‘Essentially, an empathetic person is characterised by the feeling that he truly understands what I mean’. (Faculty member)

‘Being empathetic means showing the other person that you understand and accept him or her’. (Student)

‘Empathetic can also have a tactile form. It can also mean to give someone a hug’. (Faculty member)

Helpfulness or the motivation to help others was seen as another basic component of empathy. The motivation to invest time and energy is a prerequisite to be able to be empathetic.

‘I have to…pull myself together, and that really costs energy, to be adequately empathetic’. (Faculty member)

Selection panel members emphasised that empathy does not end with (non-) verbal communication, and it also includes acting upon the information gained and active conveying of safety.

‘Empathy is not only…you feel bad, I feel with you now, but it is also acting’. (Faculty member)

3.2.4 Distinction to other concepts

Two subcategories (see Table 3) were identified when exploring the distinction of empathy to other concepts, in this case sympathy. Most of the interviewed selection panel members could not define a clear demarcation between empathy and sympathy/overall impression, but instead named many differences. One student, for example, referred to the learnability:

‘It [sympathy], in contrast to empathy, isn’t learnable’. (Student)

Others saw the difference in the fact that sympathy just happens, whereas empathy is an active process. In the assessment process itself, the interviewed selection panel members had difficulties in differentiating between empathy and sympathy/overall impression, as well:

‘I actually think that for the question whether I found someone to be empathic, sympathy had an influence. I would thus not say that I was able to differentiate this’. (Student)

‘I did not assess someone as overall good but not empathic at the same time. That did not occur. Either she or he was good, then they were also empathic, or not’. (Student)

3.2.5 Learnability

Three subcategories (see Table 3) were identified when asking the interviewed selection panel members about the learnability of empathy during medical education.

The interviewed selection panel members see empathy as only limitedly learnable during medical education:

‘My observation is, that in the age of 15 to 16 years, certain personality traits [like empathy] are already fixed.’ (Faculty member)

‘Empathy: you have it or you don’t, that is to some extent like character. Whether you can teach it…I would say, per se, that is difficult’. (Faculty member)

On the other hand, the interviewed selection panel members noticed that there are factors, such as time pressure and stress, that could have an influence on empathy:

‘We can reduce empathy superbly [through stress] in the preclinical phase’. (Faculty member).

3.2.6 Basis of assessment

Five subcategories (see Table 3) were identified when assessing the basis of assessment of empathy of the interviewed selection panel members.

The assessment seldom was based on specific questions:

‘I don’t think that one can assess empathy based on a question’. (Faculty member)

For the selection panel members, an authentic presentation that contained both good non-verbal and verbal communication was paramount. Hence, they judged gestures, facial expressions and how the applicants reacted to the selection panel:

‘Gestures and facial expression meant a lot to me’. (Student)

The students found it especially difficult to put in words on what their empathy assessment was based. Seven of 10 students referred to a ‘gut feeling’ in this context.

Doctors, on the contrary, based their assessment on stories told by the applicants:

‘And whether someone is able to be empathic, I try to find out by his stories’. (Faculty member)

3.2.7 Usefulness

Two subcategories (see Table 3) were identified when letting the interviewed selection panel members evaluate the usefulness of empathy assessment during the selection process.

Most of the interviewees judged empathy assessment as part of the selection process as useful:

‘I find that significantly more useful to include when compared to asking how the applicant differentiates science from being a doctor’. (Faculty member)

They demanded tools for empathy assessment, although:

‘I think that [empathy] is an important quality, but one has to find a tool [for the assessment]’. (Student)

3.2.8 Comparison of the empathy concepts

When comparing the IRI with the empathy concepts of the selection panel members, we found that the two IRI subscales ‘Perspective taking’ (ability to adopt another’s point of view) and ‘Empathic concern’ (ability for other-oriented emotions) found their equivalents in the answers of the interviewees. However, both the facets ‘Personal distress’ (self-oriented emotions that may occur in close or problematic interpersonal interactions) and ‘Fantasy’ (ability to empathise with fictional characters), as the two other subscales, were not present in the empathy concepts of the panel members. The empathy concepts of panel members more closely resembled the empathy concept of Mercer and Reynolds (Mercer & Reynolds, 2002) that is used in the medical context. Figure 1 gives an overview how the selection panel members’ definitions of empathy correspond to the concepts proposed by Mercer and Reynolds (Mercer & Reynolds, 2002).

Fig. 1
figure 1

Participants’ concept of empathy in light of the concept proposed by Mercer & Reynolds (2002)

4 Discussion

In our mixed-methods, cross-sectional study investigating empathy as a selection criterion for medical school admission, we found no correlation between self- and external assessments of empathy. According to our data, no common concept for empathy among the selection panel members and discrepancies to the concept underlying the IRI seem to be the main reasons for this finding.

The difference between self- and external assessments of empathy is, to our knowledge, not a subject of scientific studies. In more general papers on this topic (Bogner & Landrock, 2016; Krüger, 1980; Mummendey & Grau, 2014) and a paper describing the comparison of an interpersonal skills appraisal and JSPE scores in the context of medical school admission (O’Sullivan et al., 2017), authors concluded that social desirability may be responsible for such differences. However, social desirability is not limited to self-assessment contexts. Self-assessment of empathy might also be biased by the Dunning-Kruger effect leading unexperienced individuals to overestimate their abilities (here their empathy) or the imposter syndrome (leading individual to underestimate their abilities) (Kruger & Dunning, 1999; Langford & Clance, 1993). Another effect that could have contributed to the lack of correlation between self- and external assessments is the overall very good assessment of empathy by the selection panel members (leniency error) (Hui & Triandis, 1985). An even greater influence may be the halo effect (Thorndike, 1920): in the absence of a concept or rationale for the assessment of empathy, other attributes of the applicants, such as knowledge, communication skills, attractiveness or overall impression, may have played a role in it.

The results of the qualitative part of our study indicate that the lack of a (common) concept of empathy and its appraisal among the selection panel members may have been the main reason for the missing correlation between self- and external assessments. However, selection panel members judged empathy as an important concept in the context of selection for medical education. Most of them already had a subjective concept of empathy. The concepts of the different panel members differed, though. In addition, our study showed that the panel members had difficulties in translating their empathy concept into an assessment and differentiating empathy from other competencies. In light of the fact that no consistent concept of empathy exists even among experts and even those have difficulties differentiating between empathy, sympathy and other communication skills (Dohrenwend, 2018), this is rather unsurprising.

We found that the empathy concepts of the panel members only covered two of the IRI subscales (i.e. ‘Perspective taking’ [ability to adopt another’s point of view] and ‘Empathic concern’ [ability for other-oriented emotions]). Yamada and colleagues argued that these two subscales might even be capturing sympathy rather than empathy (Yamada et al., 2018), which may explain why these facets were not covered by the empathy concepts of the interviewees. Even with ‘Perspective taking’ and ‘Empathic concerns’ being present in the empathy concepts of most panel members, we found no correlation between self- and overall external assessments.

As mentioned in Sect. 3, the empathy concepts of panel members more closely resembled the empathy concept of Mercer and Reynolds (Mercer & Reynolds, 2002) than that underlying the IRI. In their conclusions, Mercer and Reynolds describe empathy as a complex, multidimensional concept involving the abilities to understand the patient’s situation, perspective and feelings; to communicate that understanding and check its accuracy; and to act on that understanding in a helpful way. This description resembles the results from our focus group interviews on the concept of empathy more closely when compared to Davis’ concept (Davis, 1983). However, there is no validated instrument for the assessment of empathy developed on the basis of the concept of Mercer and Reynolds.

One major limitation of our study is the participation rate among the panel members. We cannot rule out selection bias to some extent. But with time constraints being the most often mentioned reason for non-participation, we have no reason to assume systematic bias in our data. The questioned panel members were a good mix of female and male participants and students and faculty members, meaning that variables known to influence conceptualisation of empathy and experience were adequately represented in our participants. The variation of the accounts of our focus group participants hints to us having reached the criteria of maximum variation in sampling and data. Even though we only investigated panel members from a single institution, we deem our participants not to be extremely different from panel members in other institutions and believe our results to be cautiously transferable to other faculties in Germany. Furthermore, different panel members were interviewed when compared to those who assessed empathy 1 year earlier in the context of the quantitative study. Hence, the panel members were recruited from the same population and had the same prerequisites in both years.

Using the IRI might be seen as a limitation in terms of generalisability, since for the medical context, the JSPE is more common (O’Tuathaigh et al., 2019). However, its wording makes it easy to guess what it measures, which makes it particularly vulnerable to social desirability effects. This is why we decided to use a less obvious measurement of empathy in the selection situation (in which social desirability effects may play an even bigger role), even though this measure is not customised to the medical context.

Usually, focus groups should include at least 3–5 participants, going up to 12 participants or more depending on the literature (Krueger & Casey, 2014). We aimed at conducting the focus group sessions in closest time proximity to the selection interviews to guarantee that focus group participants adequately remembered their selection and decision processes. Due to the time constraints of many selection panel members, we decided to conduct questioning with even smaller numbers of participants to include as many selection panel members as possible.

Mixing methods was a strength of our study. Our sequential explanatory design started with quantitative data collection and analysis (applicants’ self-judgement and selection panel members’ judgement of applicants’ empathy), which was then followed by focus groups with selection panel members on empathy and the selection process to further explore and explain the quantitative findings. Therefore, the emphasis was on the quantitative phase of the study (QUAN → qual) (Mayring, 2000; O’Cathain et al., 2008). We chose to apply a mixed-methods approach, because quantitative data alone cannot explain why self- and external judgements (do not) correlate with each other and how external judgement takes place. Therefore, a qualitative exploration of these processes was needed to complement, fully illustrate and understand the background of the quantitative results that provided the basis for the development of the qualitative questioning routes. The qualitative data gives explanations for the rather surprising non-correlation of two measures which, as our research revealed, do not seem to measure the same but very different constructs.

In light of our results and the literature, it seems to be premature to introduce empathy as a selection criterion to the selection procedures for medical education in Germany. Neither self-assessment nor external assessment of empathy can be judged invalid methods for student admission based on our results. However, before any of these is implemented broadly, further insights have to be acquired. As a next step, further research, including prospective, longitudinal studies, is needed to determine which empathy instrument is best suitable for use in this context and regarding the question whether self- or external assessment lead to better results (i.e. better students and doctors).

While empathy is a possible amendment to established selection criteria for medical education in Germany, its external assessment should not be employed without training panel members based on an established theoretical concept of empathy and an objective self-assessment measure in order to ensure a common understanding of empathy.