The evaluation employed a quasi-experimental design featuring three groups: 1) graduated PUST residents (N = 26), 2) graduated Internatura residents (N = 8) and 3) 1st year PUST residents (N = 20). All three groups are comprised of graduated medical students specializing or having newly specialized in family medicine. Participants in Group 1 received training under the PUST program; participants in Group 2 participated in the Internatura program, and participants in Group 3 were recently graduated medical students who newly entered the PUST program with no on-the-job training so far. Therefore, our target group (intervention) is group 1 while groups 2 and 3 serve as control groups. Each group was evaluated once, but due to the nature of the groups and the timing of the evaluation, the meaning of these evaluations differs across groups. While graduated PUST residents and graduated Internatura residents were evaluated after the specialization training (post-graduation, as in a posttest), 1st year PUST residents were evaluated before completing the training (as in a pretest).
This design allows the examination of differences between participants in the intervention (Group 1) and control groups (Groups 2 and 3). A positive difference between Group 1 versus Group 2 and between Group 1 versus Group 3 would both suggest intervention effect, given that other conditions for inferring causality are met.
In our study, random assignment to intervention condition was unfeasible due to logistic and budget constraints. However, comparison of the two control groups (pre-intervention and post-intervention) help cast light on initial differences that could confound the intervention effect with selection. No differences between these two untreated groups would support group comparability. We thoroughly address this and other design limitations for sustaining causal claims in the discussion section.
The two program outcomes in the evaluation were clinical knowledge and competencies. The instruments comprised of a written multiple-choice question test (MCQT), focusing on the clinical knowledge of participants; followed by an Objective Structured Clinical Examination (OSCE) approach  focusing on clinical competencies (see Table 1). We maximized the validity of the two measures prior administration during instrument development and after administration by conducting psychometric analyses.
Instruments were developed based on nationally defined standards and competencies of a FM doctor  in combination with the highest disease burden in-country typically treated by a FM doctor  and complemented by expert opinion by consulting international and national medical experts, highly familiar with the Tajik context.
The multiple-choice questions were developed based on recommendations by the Yale Center for Teaching and Learning , the University of Waterloo Centre for Teaching Excellence  and Considine et al. . Additional validated multiple-choice questions were chosen from a pool of questions from AMBOSS GmbH  and were adapted to the Tajik context. The final test consisted of 60 multiple choice questions each one with one correct answer and four distractors. Questions present applied problems based on the most common diseases treated by a FM doctor and the highest disease burden as well as the qualifications expected of FM doctors in Tajikistan .
For the OSCE, a total of five scenarios were developed with a focus on history taking and anamnesis aiming to assess attitudes and practices; three tracer diseases and lab results, aiming to assess examination, management and communication skills. Scenarios were drawn from internationally existing OSCES for FM doctors and adapted to the Tajik context. In addition, instructions and templates for examiners and simulators were developed.
MCQT and OSCE draft versions were shared with an international and national expert group for feedback; MCQT were reviewed by an expert panel for content relevancy and coverage  to provide content-based evidence of validity for the Tajik context. All instruments were translated into Tajik and translation was carefully checked by English-speaking Tajik medical staff in the MEP who carefully compared the English and Tajik versions for accuracy and ease of comprehension.
The purpose of the psychometric analysis prior to conducting further analysis was to examine the metric properties of items and scales in the MCQT and OSCE. Item properties include difficulty and discrimination, and scale properties are reliability and validity . The analysis informed the exclusion of specific items to ensure higher reliability and validity. We followed standard practices from two psychometric frameworks: classical test theory (CTT) [32, 33] and exploratory factor analyses (EFA) [34, 35]. Using CTT, we discovered heterogeneous item difficulties ranging from perfectly easy items (0% incorrect answer) to items with 99% incorrect answer (too difficult). We discovered items with heterogeneous discrimination, with many items showing small (below 0.3) or even negative item-test Pearson correlation coefficient (discrimination index). We excluded items that were too easy or too difficult, and items with discrimination below 0.1.
Following, we conducted EFA to examine the internal structure of the two scales (MCQT, OSCE) and subscales (OSCE). In our analysis, we were interested in testing how the data complied with the intended unidimensional structure of the two measures. Thus, we tested for unidimensionality forcing a one factor solution. The exploratory factor analysis was conducted utilizing maximum likelihood estimation with no rotation. The maximum likelihood factor analysis produces a factor solution accounting for the common variance among items excluding random error and unique item variance. We discovered heterogeneous factor loadings with a few items per scale exhibiting low or negative factor loadings. Initially, we retained items with factor loadings higher than 0.3. We then decided to keep items with factor loadings higher than 0.2 avoiding sacrificing content coverage and undermining content representation and validity.
Table 2 presents the initial and final number of items per scale and subscale along with reliabilities before and after conducting CTT and EFA analyses. We only report MCQT total score because the test did not include enough items per trace-disease causing low reliability and poor factor structure concerns with subscales. We report OSCE total score and OSCE subscales because they contain enough number of items for adequate reliability and factor structure estimation. Overall, most scales and subscales work as intended. In specific, the results suggest two factors for the subscale “history of present illness” (OSCE station 1). However, only one of the factors shows acceptable reliability and correlates positively with other subscales. In addition, we excluded two OSCE subscales from further analysis due to low reliability and poor factor solutions: “counseling and challenge” (OSCE station 2) and “communication and interpersonal skills” (OSCE station 5). Thus, for all but three cases, the factor solutions comply with our assumption of unidimensionality confirming the correct functioning of the scales.
The reliability coefficient reported in our analysis is the Cronbach’s alpha internal consistency coefficient. All final scales and subscales reached acceptable reliabilities for research purposes (above 0.7 and closer to 1.0). In general, the psychometric analysis provided input for increasing the initial reliabilities. The most notable case pertains to the MCQT score with an initial Cronbach’s alpha coefficient of 0.40 and a final coefficient of 0.85.
The final step in our psychometric analyses involved estimating the Pearson correlation coefficient among scales and subscales. The scoring for scales and subscales is expressed in percentage of correct response. We expected positive correlations across scales and subscales, specifically, a positive and moderate correlation between clinical knowledge (MCQT) and clinical competencies (OSCE), and positive correlations among OSCE subscales. Overall, the data conformed with our expectations. The correlation between MCQT and OSCE was r = 0.5 (p < 0.001). The correlation coefficients among OSCE subscales ranged from − 0.08 (virtually zero, not statistically significant) and 0.76. These empirical results support the reliability and validity of the MCQT and OSCE scores for evaluation purposes.
All residents who newly entered the PUST program, and those who newly graduated in the PUST program or Internatura in 2018, were invited to participate. Recruitment of graduated Internatura residents proved challenging, with some who could not be reached, others who had left Tajikistan and some on maternity leave. Following the invitation, a total of 54 participants participated in the MCQT, and a total of 50 participants in the OSCE (Table 3). Four participants were no longer available during the assessment due to a variety of reasons, including maternity leave and government services. The recruitment strategy produced fewer participants identified as female (43%) than male (57%), and there were differences in the distribution of participants by biological sex across groups. The proportion of female 1st year PUST residents, graduated Internatura residents, and graduated PUST residents was 50, 62.5, and 31% respectively.
Participants were invited by email and followed-up by phone in case of no response. The invitation email included information on the purpose and outcome of the evaluation; and it was emphasized that participation is voluntary, results will be fully encrypted and may be published, and participation can be withdrawn any time and prior to the start of participation without any consequences. With participation consent was automatically provided. This was reiterated again prior to the start of the MCQT. Participants were compensated for travel costs but did not receive further incentives. The evaluation had been included in the project’s workplan agreed between SDC and MoHSP. Ethical approval was received from the MoHSP (Date: 1.11.2018; order number: 1–6/7747–7306).
The evaluation took place between November and December 2018 at PGMI facilities. The MCQT was administered by three invigilators, who received a short training and written instructions to read out to participants to ensure a standardized process. Inclusion criteria of invigilators included no previous or current affiliation to any of the FM programs. Participants were allocated a number and randomly assigned to different rooms and seated separately. The overall written exam took 2.5 h.
A total of four patient simulators were trained, based on individual scripts. OSCEs took place over 2 days and a total of 10 examiners were trained. Examiners came from medical institutions training Internatura and PUST residents. To reduce the Hawthorne effect, one examiner from each training institution were placed in one room, overseeing one scenario each. The grading was based on a template with a variety of pre-defined grading criteria; examiners were asked to compare grading results after each performance and come to a joint conclusion.