Atrial fibrillation (AF) is the most common cardiac arrhythmia worldwide [1]. AF is a chronic and progressive disease caused by ineffective atrial contraction. Age is an important risk factor for AF and improvements in life expectancy has increased the AF incidence rate by 31% over the past 20 years [2]. Global AF prevalence will continue to increase if life expectancy continues to improve. As a result of its increasing prevalence, AF presents a significant economic burden to healthcare systems [3].

Patients with AF experience symptoms such as palpitations, chest tightness, fatigue, shortness of breath, and dizziness, all of which limit the ability to perform daily activities [1, 4]. Other AF related outcomes include stroke risk, heart failure, depression, and impaired quality of life [1]. QoL has been found to be lower in AF patients than in both healthy individuals and patients with other cardiovascular diseases [5, 6]. Current AF management, including rate and rhythm control, aims to improve QoL through symptom alleviation. Treatment options for rhythm control include cardioversion, anti-arrhythmic drugs (AADs), and catheter ablation [1]. AAD medical therapy is well established for AF patients however, catheter ablation has seen significant technological advancements and therefore represents a large proportion of the clinical studies [7].

Although the main objective of such therapies is QoL improvement, there is a lack of appreciation in how to measure patient QoL in clinical practice. In the context of AF, studies have found diagnostic ECG results to have a weak and inconsistent relationship with QoL impairment, suggesting that QoL outcomes in AF patients are not determined by clinical indicators alone [11]. QoL instruments can be generic or condition specific. Generic instruments are designed to be used across a wide range of conditions [8]. Condition-specific instruments are designed for a specific population, such as individuals with AF. Although not as generalizable as generic instruments, condition-specific instruments have greater sensitivity to detect small changes in QoL and may be less impacted by comorbidities unrelated to study interventions [9].

The primary objective of this study is to review the measurement properties of the most frequently used AF-specific health-related quality of life (QoL) instruments through a systematic review of the studies which validated and designed these instruments. This review used the COSMIN methodology guidelines for systematic reviews of patient-reported outcome measurements which includes rigorous assessment of validity, reliability, and instrument responsiveness.


Literature mapping exercise

Literature mapping exercises are used to characterize a large body of literature aiming to guide decisions about more focused analyses [10]. A literature mapping exercise was completed prior to our systematic review to identify the most common AF QoL instruments used across articles assessing catheter ablation for AF. This study uses quality of life instruments as an overarching term for any patient reported outcome measure (PROM) that assesses a person’s health related well-being, functional status, and disease related symptoms. Because previous reviews [11,12,13] have identified over 40 distinct AF QoL instruments, this exercise assisted with narrowing the scope for the systematic review.

A literature search was performed in Embase for articles published from 2016 to 2021, using the keywords “atrial fibrillation”, “quality of life”, and “catheter ablation”. The full search strategy can be found in Additional file 1: Appendix A. A five-year time frame was used to assess whether AF QoL instruments used across studies have changed since Kotecha et al. review [13] of AF QoL instruments was published in 2016 [13]. Catheter ablation was added as a keyword to restrict the type of intervention given the breadth of studies on ablation due to continual technological advancements and iterations and its primary objective to improve patients QoL.

Article titles and abstracts were screened by two reviewers. Studies including participants with heart failure (HF) or other arrhythmias were included if the population studied also had AF. HF was included because HF and AF coexist and share a bidirectional relationship [14]. AF occurs in more than half of individuals with HF and HF occurs in more than one third of individuals with AF [15]. Studies were included if they measured QoL outcomes and reported the AF QoL instruments used. Editorials, clinical guidelines, and abstract posters were excluded. AF QoL instruments cited more than once were included in the systematic review search strategy.

Search strategy and selection criteria

To review the measurement properties of the AF-specific quality of life (QoL) instruments identified from the literature mapping exercise outlined above, a systematic review was undertaken to assess articles that validated or designed these instruments.

A systematic search was performed in Ovid MEDLINE, Ovid Embase, Ovid PsycINFO, EBSCO CINAHL, and Cochrane CENTRAL on July 30, 2021, for studies published in any language between inception to July 2021. The search strategy, which consisted of filters for AF, QoL instruments, and measurement properties, was derived from the COSMIN guidelines [16]. The COSMIN search strategies were translated for use in Ovid, EBSCO, and Cochrane interfaces, for use in this review. A copy of the full search strategy for MEDLINE can be found in Additional file 1: Appendix B.

We included all studies with a population of AF patients that appraised the measurement properties of one of the AF QoL instruments identified from the mapping exercise above. Only full text articles were included. Two reviewers independently screened titles and abstracts and conflicts were resolved through discussion between the two reviewers. The same process was repeated for full text eligibility screening.

Data extraction

Data items included article title, author, year of publication, country, population characteristics, AF QoL instrument characteristics, measurement properties, and information on instrument interpretability and feasibility. Data extraction for included studies was completed by one reviewer. The second reviewer confirmed the entries by comparing the completed data extraction table with full text articles.

Quality assessment and risk of bias

The COSMIN Risk of Bias checklist [17] was used to assess the methodological quality and risk of bias in each single study. The studies were rated as very good, adequate, doubtful, or inadequate quality according to the Risk of Bias checklist. Then, the results of each single study were rated according to the COSMIN criteria for good measurement properties, which are summarized in Table 1. The results were rated sufficient, insufficient, or indeterminate. Once each study was rated according to the COSMIN Risk of Bias checklist and the criteria of good measurement properties, the results from all included studies on an instrument were pooled together. The overall rating was then compared against the criteria of good measurement properties to arrive at a final rating of sufficient, insufficient, or indeterminate. Each criterion per measurement property was scored through a quality assessment to determine the quality score for each study. Quality assessment and risk of bias were completed independently by each reviewer with conflicts being resolved through discussion between the two reviewers.

Table 1 Definition and criteria of good measurement properties, as defined by the COSMIN guidelines [18]


Literature mapping exercise

The literature mapping exercise identified 106 unique articles including 45 QoL instruments. 13 of these instruments were considered AF specific and are listed in Fig. 1. The EHRA and CCS-SAF scales were excluded from the systematic review because they are caregiver evaluated scores rather than patient reported. After this exclusion and the exclusion of instruments that only appeared in a single study, six instruments were included in the systematic review: AFEQT, AFSS, MLHF-Q, ASTA, AFQLQ, and SCL. A list of all 45 instruments identified from the mapping exercise can be found in Additional file 1: Appendix C.

Fig. 1
figure 1

Frequency of use of AF QoL instruments across AF ablation studies

Systematic review: study selection

The six AF QoL instruments identified from the literature mapping exercise were included and evaluated in our systematic review (AFEQT, AFSS, MLHF-Q, ASTA, AFQLQ, and SCL). The search results are outlined in the PRISMA flowchart in Fig. 2. After the removal of duplicates and the screening of full text articles, 16 studies were included in the review. One study was found from the reference list of an included article. All other articles were identified in the initial literature search.

Fig. 2
figure 2

PRISMA flow diagram

Characteristics of included studies

Table 2 provides a summary of the study characteristics of the included studies [19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35]. Table 3 summarizes the instrument characteristics for the six included QoL instruments.

Table 2 Study characteristics of the included studies
Table 3 Characteristics of included QoL instruments

Overview of QoL instruments

Outlined below is a concise overview of the results from our systematic review for each AF QoL instrument. The COSMIN risk of bias checklist and COSMIN criteria for good measurement properties were used to assess the study methodology and measurement property quality [17, 18]. The risk of bias and measurement property ratings can be found in Additional file 1: Appendix D. The data extraction table of measurement properties can be found in Additional file 1: Appendix E.


With six studies, AFEQT was the most validated QoL instrument in this systematic review. Three studies validated the translation of AFEQT into another language [23, 25, 26]. AFEQT also had the most comprehensive evaluation of interpretability [22, 24]. Dorian et al. [22] found a 19 point change in AFEQT score to represent a moderate improvement in QoL whereas Holmes et al. [24] found a 5 point change in AFEQT score to represent a clinically important difference. Spertus et al. [21] found that the question about sexual relationships had a disproportionately high missing response rate (15%). In comparison to generic instruments (EQ-5D and SF-36), AFEQT was found to show greater effect sizes for responsiveness, comparable to other condition-specific instruments like SCL and AFSS [21]. Many of the AFEQT measurement properties were indeterminate, either due to correlation coefficients not meeting the COSMIN criteria or missing statistical methods. The methodological quality of its content validity is unknown because it is unclear whether professionals were asked about AFEQT’s relevance or comprehensiveness during the development process.


AFSS was evaluated in two studies [27, 28]. AFSS is considered a symptom scale rather than an overall QoL instrument because it only includes AF burden, symptoms, and healthcare utilization domains. The average completion time was less than 5 min [28]. AFSS is the only instrument to have insufficient internal consistency. Cronbach’s alpha for the healthcare utilization domain was less than 0.70, indicating that it is not closely related to the symptom and burden domains. Furthermore, hypothesis testing was also insufficient because it did not meet the correlation coefficient criteria. Test–retest reliability was found to be superior in AFSS, compared to all other QoL instruments [27]. Missing response rates varied from 0 to 7% across all items [28].


The SCL was evaluated in three studies [20, 29, 30]. Developed in 1996, it is the oldest instrument included in this study. Like the AFSS, the SCL is a symptom scale and it is used across all arrhythmias, not just AF. Although the SCL has been used for over two decades, the original English version has not been extensively validated. There are no published studies on the PROM development process or content validity. Structural validity was rated indeterminate because the required statistical measures were not reported. In terms of interpretability, response rates ranged from 94 to 84% [30].


AFQLQ was only evaluated in one study [31] and the origins of it are unclear. It is likely to have originated in Brazil with translations mostly being done in languages other than English. The methodological quality used to calculate reliability is unclear as statistical methods were not reported. AFQLQ is the longest condition-specific instrument in this review with 30 questions, raising concerns about feasibility and completion time. Interpretability and feasibility characteristics were not reported.


ASTA was evaluated in three studies [32,33,34], two of which were validating Polish and Portuguese translations. ASTA is the newest instrument included in this study and can be used in all types of arrhythmias. The content validity is unclear because it is reported whether patients were asked about the relevance of each individual item in the instrument. There was also uncertainty in hypothesis testing because correlation coefficients did not meet the COSMIN criteria. Otherwise, ASTA has relatively good measurement properties. The completion time was described as “a few minutes” [32]. Overall, 46% participants responded with “I don’t know” to at least one item, which increased the number of missing responses [32].


MLHF-Q was only evaluated in one study [35]; however, it has been validated in other populations outside of AF [36,37,38]. The justification for including MLHF-Q despite it being developed for HF was provided in the “Methods” section. Test–retest reliability was rated doubtful because the time interval between the two tests was 3 months, which is extensive for a reliability measurement. Structural validity was inadequate because the factor analysis sample size was too small. Hypothesis testing was rated insufficient because the correlation coefficients did not meet the COSMIN criteria. Interpretability was assessed through floor and ceiling effects, which found a moderate floor effect in 11.4–13.6% of scores.

Synthesis of results

Results from the risk of bias and measurement property appraisals are qualitatively summarized in Fig. 3. Though not considered to be measurement properties, columns for interpretability and feasibility were added because they are important characteristics of QoL instruments [18]. Evidence was synthesized using the COSMIN guidelines. This final rating represents the psychometric quality of each instrument, taking into consideration the quality of measurement properties, study methodology, and risk of bias.

Fig. 3
figure 3

Synthesis of results

Our systematic review illustrated that none of the 16 studies evaluated the measurement properties of cross-cultural validity or measurement error. While multiple studies validated the translation of AF-related QoL instruments [25, 26, 28, 30, 33, 34], they did not complete cross-cultural validity tests as recommended by the COSMIN guidelines, which requires the inclusion of two subgroups for analysis [18]. Internal consistency was the only measurement property evaluated across all studies and was attributed the highest quality results (Fig. 2). PROM development and content validity were only assessed by the articles that originally developed the AF-related QoL instrument [21, 32]. Overall, ASTA and AFEQT were the two strongest performing AF-related QoL instruments in terms of measurement properties and study methodology, with sufficient ratings in both instrument development and internal consistency.


This comprehensive systematic review identified 6 AF-related QoL instruments and evaluated their measurement properties using the COSMIN guidelines. This highlights that, within the field of AF ablation clinical research, QoL instruments are constantly evolving, and new instruments are still being developed. The results of the literature mapping exercise are aligned with a previous study examining AF QoL instrument frequency [39]. Moreover, even though Coyne et al. [39] study was published in 2005, SF-36 was also the most common QoL instrument, followed by the SCL, NYHA score, and MLHF-Q appeared in both their study and this study’s mapping exercise [40].

Although this review identified different QoL instruments, the findings of this review are consistent with previous reviews [11,12,13]. Measurement error, cross-cultural validation, and responsiveness studies are still the most deficient areas of research and AFEQT is still the strongest rated AF QoL instruments. Moreover, even though an increasing number of psychometric studies on AF QoL instruments have been published in recent years, none of the available instruments have been fully validated across all measurement properties.

Responsiveness represents the ability of a QoL instrument to detect changes over time. A more responsive instrument will be able to detect smaller changes pre- and post-intervention. Interpretability is the degree to which one can assign meaning to a QoL instrument score or a change in scores. Measures like clinically important difference (CID) or minimally important change (MIC) are important for interpreting whether a change in score has a meaningful or significant impact on a patient’s QoL.

Interpretability is rarely measured for QoL instruments and even when measured, studies often produce conflicting results. In this systematic review, two studies assessed the interpretability of AFEQT. Dorian et al. [22] found a 19-point change in AFEQT score to represent a moderate improvement in QoL whereas Holmes et al. [24] found a 5-point change in AFEQT score to represent a CID. The large difference in scores could be attributed to the study authors using different anchors. For example, Dorian et al. used patient and physician assessments of QoL change and Holmes et al. used the EHRA score. Additionally, in a study calculating MIC using five different statistical methods, the five methods produced five different MIC values [44]. Our review outlines that further research is required in this area of psychometrics to standardize the statistical methods used to assess interpretability.

Hypothesis testing refers to the degree to which expected similar instruments are in fact similar or the degree to which expected dissimilar instruments are indeed dissimilar. Hypothesis testing is typically completed by calculating the correlation of scores from two presumably similar or dissimilar instruments. The hypothesis testing rating was determined by comparing the study results to the authors’ pre-determined hypothesis. Figure 2 illustrates that, nearly every AF QoL instrument in this review was rated poorly for hypothesis testing because the correlation coefficients were not significant enough to meet COSMIN criteria. This is anticipated considering that most hypothesis tests were completed with a condition-specific instrument and a generic instrument (e.g. correlation between AFSS and SF-36 scores) [41]. Since generic instruments are less sensitive to AF-related QoL compared to AF-specific instruments, weak correlations are to be expected [9].

The study that provided the best hypothesis testing result was Cannavan et al. [34]. Cannavan et al. [34] differed from the other studies because correlations were calculated for ASTA and AFQLQ scores, two AF/arrhythmia specific instruments, which yielded very strong correlations. Furthermore, Cannavan et al. [42] used Portuguese versions of ASTA and AFQLQ, with Portuguese being the original language of the AFQLQ. Results from Cannavan et al. [34] pose the question of whether there should be a universal AF-specific QoL instrument with translations or if each country or geographic region should develop their own instrument that is best suited for the setting. This could not be answered in this review because none of the included studies evaluated cross-cultural validity. In addition to responsiveness and interpretability, cross-cultural validation is another area that requires future research.

The findings of this review should be interpreted with consideration due to some limitations. Firstly, only condition specific AF-related QoL instruments were reviewed and generic QoL were out of scope [43, 41]. This is a limitation as it does not include all possible QoL instruments that could be used in an AF patient population. There are also some limitations to the use of the COSMIN guidelines. The COSMIN guidelines provide very high standards for what constitutes a good measurement property and good study methodology. To have good ratings across all measurement properties, many psychometric studies must be performed and published, which may not be feasible for every existing instrument. This suggests that there may be very well-developed instruments that exist but have yet to be psychometrically validated. While the COSMIN criteria for good measurement properties may be difficult to fulfil, it provides a benchmark for the development of new or updated instruments.

With the growing prevalence of AF QoL assessment is crucial. As outlined above this study identified four additional AF QoL instruments expanding on Kotecha 2016 review [13]. In addition, over 34 different QoL instruments have been leveraged across published studies (10). This emphasizes the lack of consensus on the most appropriate AF QoL instrument to use with patients. Our systematic review using COSMIN methodology suggests more robust validation is required. It should be noted that measurement properties are only one determinant of applicability of AF QoL instruments. Ease of use, administration, time taken, recall period and patient satisfactoriness are also important variables.


This review identified six most frequently used AF- specific QoL instruments across AF ablation studies. Using the systematic COSMIN methodology, we undertook a review of these six AF- QoL instruments measurement properties through evaluating studies which validated and designed these instruments. We identified ASTA and AFEQT as the best validated instruments. However, further research is needed in areas of cross-cultural validation, measurement error, responsiveness, and interpretability across all six instruments.