Depression frequently occurs in patients with cancer1,2. Even mild levels of depression reportedly decrease the quality-adjusted life-year score3. Furthermore, patients with cancer have a higher risk of suicide than the general population4,5,6. Several interventions, such as specialized palliative care, can reduce psychological symptoms7,8, thus, requiring accurate scales for measuring depression. Clinicians commonly evaluate the condition through self-administered tools, such as the Patient Health Questionnaire-9 (PHQ-9)9,10 and the Hospital Anxiety and Depression Scale (HADS)11, both of which are based on classical test theory. These have several disadvantages, including sample dependency, inability to replace or add items, and requirement for participants to answer all items regardless of severity. Scales developed based on item response theory (IRT) and computer adaptive testing (CAT) can address these limitations12. This is because IRT-based scales can reveal sample-independent subject traits, and CAT can optimize the way items are presented and reduce the associated burdens placed on patients. Several studies have applied CAT-based depression scales, which were developed in the Patient-Reported Outcomes Measurement Information System (PROMIS) project, to patients with cancer13,14,15,16. However, the usefulness of CAT-based scales for measuring depression in patients with cancer has not yet been fully explored. Thus, this study aimed to develop a CAT-based scale for measuring depression in patients with cancer.


Ethical approval

All participants provided written informed consent. The institutional review board of the National Cancer Center Hospital (approval number: 2010-202) and all the participating sites approved the study. This study was in accordance with the ethical standards of the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Study design and participants

This multicenter prospective study was conducted at four tertiary centers in Japan (National Cancer Center Hospital, National Cancer Center Hospital East, Okayama University Hospital, and the University of Tokyo Hospital) between May 2011 and December 2012. The study included patients who (a) were aged ≥ 20 years, (b) had been diagnosed with any type of cancer, (c) had Eastern Cooperative Oncology Group performance status ≤ 2, and (d) were selected for or were already receiving anti-cancer treatments. Patients who (a) had received psychiatric treatments within the previous two months, or (b) were considered extremely sick to participate by their physicians-in-charge were excluded from the study. We recruited participants upon admission.

Data collection

We developed 62 items to measure depression. Table 1 shows examples of items translated from Japanese to English. Several psycho-oncologists independently drafted the items based on the diagnostic criteria and common symptoms of depression. Subsequently, they discussed and finalized these items. All items asked participants about their depressive mood over the preceding week, with each item answered on a 5-point scale (1 = none, 2 = rarely, 3 = sometimes, 4 = often, 5 = always).

Table 1 Examples of items translated from Japanese to English.

To confirm concurrent validity, participants completed the PHQ-9, which is widely used to measure depression and has been validated in patients with cancer9,10. The PHQ-9 total scores of 5–9, 10–14, 15–19, and 20–27 correspond to mild, moderate, moderately severe, and severe depression, respectively.

Overview of statistical analyses

Based on the analytic methods used in the PROMIS project17 and several studies on CAT18,19,20, we conducted the following analyses: (1) descriptive statistics, (2) evaluation of the IRT assumptions, (3) fitting a graded response model (GRM) to the data, (4) evaluation of differential item functioning (DIF), (5) CAT simulations, and (6) calibration of the PHQ-9. All analyses were conducted using the open-source R software (version 4.1.1). Statistical significance was set at P < 0.05.

Descriptive statistics

Cronbach’s alpha was used to measure internal consistency (analyzed using the R package “psych,” version 2.1.9). Items with unanswered categories were excluded because their parameters could not be estimated, and those with an item-remainder correlation < 0.3 were also excluded due to violation of internal consistency21.

Evaluation of assumptions of the IRT model

We evaluated the assumptions of IRT including unidimensionality, local independence, and monotonicity17.

We tested unidimensionality by conducting principal component analysis (PCA), confirming that the proportion of variance of the first factor was ≥ 20% and the ratio of variance of the first factor to the second factor was ≥ 417,18. We excluded items with a low contribution to the first factor to satisfy these criteria.

Subsequently, we tested local independence by conducting a one-factor confirmatory factor analysis, producing a residual correlation matrix (analyzed using the R package “lavaan,” version 0.6–9). From the pairs of items with residual correlations > 0.217,18, we excluded the item with a lower contribution to the first factor of the PCA.

Finally, we tested monotonicity by developing a nonparametric IRT model (analyzed using the R package “mokken,” version 3.0.6), and excluded items with a scalability coefficient < 0.318.

Graded response model

We fitted a GRM to the remaining items (analyzed using the R package “mirt,” version 1.35.1) to estimate discrimination and difficulty parameters for each item and latent factor θ (i.e., degree of depression) for each patient using maximum a posteriori (MAP). Subsequently, we excluded items that contained categories without maximum probability at any θ. We also examined fit statistics (S-X2) for each item, excluding those with a poor fit, as determined at an alpha level of 0.0117.

Evaluation of DIF

We evaluated DIF for age (≥ 65 or < 65) and sex (male or female) (analyzed using the “DIF” function in the R package “mirt,” version 1.35.1) and excluded items with an alpha level of 0.0117,18.

CAT simulations

Following the item selection process, we recalculated Cronbach’s alpha, redeveloped a GRM, and recalculated discrimination and difficulty parameters for each item as well as θ for each patient (θtrue).

We used the resulting items and θtrue to perform CAT simulations (analyzed using the R package “catIrt,” version 0.5–0)19. At the beginning of the simulations, the estimated latent factor (θest) was set to zero, and the minimum number of items administrated was set to three. We conducted simulations using various combinations of latent factor estimators, item selection methods, and termination criteria.

Latent factor estimators were: (a) maximum likelihood estimation (MLE), (b) Bayesian modal estimation (BME), and (c) expected a priori estimation (EAP). Item selection methods were as follows: (a) unweighted Fisher information (UW-FI), and (b) pointwise Kullback–Leibler divergence (FP-KL). Termination criteria were: (a) standard error of measurement (SEM) threshold of 0.32 or (b) that of 0.50, while the simulations were also terminated upon reaching the maximum number of items.

We calculated Pearson’s correlation coefficients (PCCs) between θest and θtrue to measure the simulation accuracy, and PCCs between θest and the total score on the PHQ-9 to confirm concurrent validity.

Calibration of PHQ-9 to the IRT model

To compare the measurement precision of the scale with that of the PHQ-9, we calibrated the PHQ-9 to the GRM model (analyzed using the “fixedCalib” function in the R package “mirt,” version 1.35.1), and performed an estimation using the calibrated items20. We plotted the Lowess curves of SEMs for the following: (a) CAT simulations with a fixed number of items and (b) estimation using the calibrated PHQ-9 items. Subsequently, we determined the minimum number of items required to surpass the measurement precision of the calibrated PHQ-9.


Study participants

A total of 393 participants completed the questionnaires. The average score for all items was 1.44/5. The descriptive data are shown in Table 2. Among 289 patients who completed the PHQ-9, 77 (27%), 15 (5%), and 5 (2%) patients showed mild, moderate, and moderately severe to severe depression, respectively.

Table 2 Descriptive data of study participants.

Item selection and parameter estimation

Cronbach’s alpha for all 62 items was 0.97. As shown in Fig. 1, 28 of these items were included in the GRM and CAT simulations. Cronbach’s alpha was 0.95 after the item selection. Most unanswered categories comprised those with higher scores (4 or 5).

Figure 1
figure 1

Flowchart of the item selection process. The flowchart shows the number of included and excluded items, as well as the reasons for exclusion.

The examples of parameters in the GRM are shown in Table 3 (see Supplementary Table 1 for the parameters of all the items). Overall, the discrimination parameters ranged from 1.53 to 3.32. The first and last difficulty parameters ranged from 0.09 to 1.60 and 2.55 to 4.33, respectively. The item with the highest discrimination parameter was “I feel depressed and have difficulty in daily life.” The items with the lowest difficulty parameter were “I often feel helpless” and “I feel hopeless for the future.” The items with the highest difficulty parameter were “I need help with my depression” and “Others don't understand me”.

Table 3 Examples of estimated parameters.

CAT simulations

The results of the CAT simulations are presented in Table 4. When the termination criteria of the SEM threshold were set to 0.50, the most accurate simulation used the BME estimator and UW-FI item selection, achieving a PCC of 0.916 (95% confidence interval [CI], 0.899–0.931) using an average of 6.96 items. It also achieved a PCC with a total PHQ-9 score of 0.669 (95% CI, 0.600–0.728).

Table 4 Results of the computerized adaptive testing simulations.

The Lowess curves for the SEMs of the CAT simulations are shown in Fig. 2. CAT using only four items had smaller SEMs at any θest than the estimation using the calibrated PHQ-9. The estimated parameters of PHQ-9 are listed in Supplementary Table 2.

Figure 2
figure 2

Lowess curves for the standard error of measurement (SEM) by θ (degree of depression). Each line indicates the Lowess curve for SEMs in each estimation (□, estimation using the calibrated PHQ-9; ○, 4-item CAT; △, 8-item CAT). The measurement precision of CAT using only four items surpassed that of the estimation using the calibrated PHQ-9.


We developed a new scale for measuring depression in patients with cancer based on an IRT model and CAT simulations. The CAT simulations showed that a small number of items could accurately measure the degree of depression. The scale also showed a significant correlation with the PHQ-9 total score and achieved a smaller SEM than the calibrated PHQ-9 using only four items.

More than half of the items were excluded through the item selection process. The main reasons for this included unanswered categories, violations of local independence, and unsuitable category response curves. The existence of unanswered categories, which mainly comprised those with higher scores, may have resulted from the exclusion of patients undergoing psychiatric treatment. The existence of local dependence in many items might suggest duplication or redundancy in our item development. The unsuitable category response curves may have resulted from sample size insufficiency because the GRM requires more than 500 samples to estimate the parameters accurately22. The remaining 28 items exhibited a Cronbach’s alpha of 0.95, suggesting substantial internal consistency23.

The exclusion of more than half of the items may also be attributable to the item selection using the unidimensional model. Instead, the bifactor model applied for larger item banks would be beneficial for developing CAT with more items. Gibbons et al. showed that such an analysis could result in the development of a CAT measuring depression/anxiety with hundreds of items24,25. Such an analysis would also be necessary for our aim to develop CAT measuring depression in patients with cancer.

The parameters of several items may explain the characteristics of depression in patients with cancer. The discrimination parameter corresponds to the slope of the GRM and indicates the ability to discriminate subjects’ traits. The highest discriminative item was about the influence on daily life. Such an influence appears highly informative for assessing depression in patients with cancer. A previous study, which assessed depression in patients with cancer using IRT, showed that social withdrawal or decreased talkativeness is highly discriminative26, which is similar to the result of the present study. However, other IRT-based studies on depression in patients with cancer did not include items that assessed the influence on daily lives27,28,29. Thus, the importance of this item needs to be further examined.

The difficulty parameter indicates the traits of participants at which the probability of choosing either of the two adjacent categories is equal. Thus, items with high difficulty parameters were selected by participants with high severity, whereas items with low difficulty parameters were selected even by participants with low severity. The items about helplessness and hopelessness showed low difficulty parameters, suggesting that patients with cancer easily experience these symptoms. In contrast, the items about support and understanding from others showed high difficulty parameters, suggesting that these symptoms would be observed in highly depressive patients with cancer. These items were not included in previous IRT-based studies on depression in patients with cancer26,27,28,29. In addition, the item selection process may have excluded items with higher difficulty or those with less difficulty. Thus, further studies are required to determine the importance of these items.

The CAT simulations achieved high measurement accuracy using a small number of items, exhibiting strength in shortening the health measurement scales. The significant correlation with the PHQ-9 score implies the ability of the scale to measure depression. Moreover, the CAT simulations showed a higher measurement accuracy than the estimation using the calibrated PHQ-9. Thus, the scale can be employed in clinical settings to efficiently evaluate depression in patients with cancer. The CAT developed in this study could be made available online, as in the PROMIS project, which would allow efficient assessment of depression in patients with cancer to be applied in clinical settings, such as palliative care and psycho-oncology.

This study has several limitations that need to be addressed. First, we excluded patients who were undergoing psychiatric treatments, which may have affected item selection and limited the situations of the CAT’s usage. Second, the sample size was insufficient. GRM reportedly requires more than 500 samples to estimate parameters of 25 items appropriately22. However, we recruited only 393 participants to estimate the parameters of 62 items. Third, we could not perform a DIF analysis for the history of depression because only few participants had it, which is likely due to the exclusion of participants under psychiatric treatments. Fourth, the final item set is limited, and only a small number of them are available for adaptive administration at each level of depression severity. Fifth, the moderate correlation between the CAT and the PHQ-9, with a correlation coefficient of 0.67, may imply inadequate measurement of depression by the scale. This result might also suggest that the items did not cover all subdomains of depression, such as somatic symptoms. Further investigation is necessary to examine the correlation between the CAT and the HADS. Finally, we did not confirm that the scale could accurately classify whether a patient has a major depressive disease or not. Several studies examined the diagnostic performance of the developed CAT for patients diagnosed through gold standard measures, such as structured interviews24,25. Such examination for diagnostic ability is also required for the newly developed CAT in the present study.

In conclusion, this study developed a scale for measuring depression in patients with cancer based on IRT and CAT, providing a useful and improved way for clinicians to evaluate depression in patients with cancer.