1 Introduction

Drivers with road traffic offences are those who have committed traffic violations such as using their phones while driving, exceeding speed limits, or driving under the influence of alcohol or drugs. These behavioral patterns jeopardize road safety and so offenders are penalized by fines or imprisonment. Since the end of the 20th century, many European countries have created specific national systems to deal with road offences (DGT 2011). Through these systems it is possible to determine if a driver is disqualified or at risk of being so.

In Spain, Driving License by Points consists of losing intangible points for each traffic violation. Drivers are disqualified when there is a complete loss of all points or when a severe offence has been committed. If drivers want to drive again, they will have to attend driving courses (Ministerio del Interior 2005). These courses can be either a partial course, where it is possible to recover up to 6 points or a total course, where the aim is to recover the driving license.

Emotions have a significant influence on drivers’ perception, attention, attitudes and behavior, and as a consequence on driving performance and possible road traffic violations (Eherenfreund-Hager et al. 2017). For example, reckless and careless drivers seek thrills when driving and tend to fail to comply with road safety standards, angry and hostile drivers show anger in their actions towards other drivers, and the anxious driver style is linked with driving under high levels of stress (Mirón-Juárez et al. 2020; Navon-Eyal and Taubman-Ben-Ari 2019). Thus, adaptive regulation of emotions might help to achieve safe driving (Trógolo et al. 2014).

Research on Emotion Regulation (ER) has risen in popularity over the past two decades within the field of psychology. ER is a multidimensional construct that involves several abilities (e.g., awareness, understanding, acceptance) focused on the achievement of an adaptive response to emotions (Gratz and Roemer 2004). Further, the consequences and trajectory of an emotion will depend on how the person responds to that emotion (Gratz et al. 2018). Indeed, the relationship between ER and psychopathology is notably explained when there are difficulties to properly regulate emotions, what is usually known as emotion dysregulation. The rapid growth of this construct is also shown in the exploration of the role of ER in less common application fields such as traffic psychology.

ER strategies such as controlling impulses, strategies accessing, goal-directed behavior or acceptance of own emotions might be altered in drivers who behave in hostile or angry ways (Navon-Eyal and Taubman-Ben-Ari 2020). Expressive suppression is another ER strategy that has been shown to have a link with dangerous driving events (Parlangeli et al. 2018) and that has been especially used by males, who commit more traffic offences (DGT, 2011). On the contrary, females have less difficulty with emotional awareness (Šeibokaité et al., 2017) and tend to use more cognitive reappraisal which may help to reduce their level of anger and adopt more careful driving styles (Holman and Popușoi 2020). In driving contexts, making use of cognitive reappraisal implies being more aware of risk behaviors, which may reduce violations, errors and lapses (Parlangeli et al. 2018). Experiential avoidance has also been related to fewer offences, even though its desirable effects are not as pronounced as the cognitive reappraisal ones (Holman and Popușoi 2020).

Current studies mainly aim to assess how emotion and its monitoring could interfere in driving styles. Self-reports have been the most used instruments to evaluate maladaptive ER in driving context. Specifically, we must highlight the questionnaire Difficulties in Emotion Regulation Scale (DERS). DERS is the most commonly used instrument to assess ER difficulties (Pérez-Sánchez et al. 2020a). It was developed based on a functional perspective that conceives ER as multifaceted, implying understanding, acceptance and awareness of emotions are necessary to be able to manage them (Gratz and Roemer 2004). DERS consists of 36-items organized in 6 subscales (nonacceptance, goals, strategies, impulse, awareness, clarity). This test includes 11 reverse-scored items. The response format is 5-Likert scale that ranges from 1 (almost never) to 5 (almost always). Scores can be calculated both for the total scale and for each of the subscales. Higher scores indicate more difficulties to regulate emotions.

Considering the importance of the quality of tests as assessment instruments, it is of great relevance to explore their psychometric properties, especially when they are widely used instruments both in general and in the context of driving. The analysis of the DERS scores has been usually carried out from the Classical Test Theory (CTT) approach (Pérez-Sánchez et al. 2020b). These classical assumptions have brought a handy and simple approach that has dominated psychometrics in the last century (Engelhard and Wang 2021). However, one of the main disadvantages of using CTT is that scores are ordinal and lack measurement invariance (person measures are item-dependent and item measures vary among samples of people). The dichotomous Rasch Model (RM) and its extensions for polytomous items (widely used in the framework of Item Response Theory) try to solve some of the problems of CTT. These models make it possible to obtain invariant measures of items and persons in a dimension with interval properties, analyze their interactions and obtain specific estimates of the precision of the parameters (Bortolotti et al. 2013; Engelhard and Wang 2021; Golia and Simonetto 2015). In addition, it is easy to analyze whether the data meet the requirements of unidimensionality and absence of differential item functioning associated with demographic or cultural subgroups. We have chosen the Rating Scale Model (RSM) proposed by Andrich (1978), a widely used extension of the RM to analyze scales with a Likert-type format. As described below, this model allows empirical analysis of the functionality and metric quality of response categories.

RSM’s formulation states,

$${\text{ln }}\left( {{{\text{P}}_{{\text{nik}}}}/\left( {{{\text{P}}_{{\text{ni}}({\text{k}} - {\text{1}})}}} \right)} \right) = {{\text{B}}_{\text{n}}} - {{\text{D}}_{\text{i}}} - {{\text{F}}_{\text{k}}}$$
(1)

where: Pnik is the probability that person n on encountering item i would be observed (or would respond) in category k, Bn is the level of person n, Di is the location of item i, Fk is a rating scale threshold defined as the location corresponding to the equal probability of observing adjacent categories k-1 and k.

Likert-type response categories are arbitrarily assumed as equidistant, ordered and symmetrically distributed. In other words, the scale scores are being treated arbitrarily as if they were interval scores.

However, the fulfillment of these assumptions is not analyzed empirically, so Likert scales can be considered an example of arbitrary measurement. RSM, on the other hand, allows empirical analysis of the properties of the response categories and their metric quality. Consequently, the examination of response format is an essential requirement to make adequate interpretations (Simms et al. 2019). According to Linacre (2002), a good functionality of rating scales should meet the following criteria: (a) high frequencies in each category, at least 10 observed counts, (b) acceptable distribution of observations in the categories: uniform, unimodal peaking in central or extreme categories, or bimodal distributions peaking in extreme categories, (c) the average person measures of each category must go monotonically up the rating scale, (d) infit and outfit mean square statistics in each category have to be less than 2.00, and (e) thresholds between categories have to be ordered monotonically. If response categories do not work properly, adjacent categories have to be collapsed (Cavanagh and Fisher 2018).

The assumptions of unidimensionality, local independence of the items, invariance of the measures and conjoint measurement of persons and items are central in RSM and Rasch models. One of the most representative advantages of Rasch models is the visual representation of items and persons on the same scale, provided that invariant measurement is met and if the instrument is well targeted to the level in the latent variable of people assessed (Prieto and Delgado 2003).

So far, the quality of the DERS scores has been studied from the CTT perspective. Considering the advantages of the Rasch family of models, our objective was to evaluate the psychometric properties of DERS scores with the RSM in a driving offences context.

2 Method

2.1 Participants

The final sample consisted of 318 Spanish drivers, all males with ages ranging from 20 to 69 years old (M = 41.6; SD = 11.0). With regard to educational level, most participants had completed middle school as a minimum (69%). In terms of driving frequency, 89% of the participants drive on a daily basis, as long as they had a valid driving license.

The sample was divided into two groups: driving school sample (n = 159) and matched sample (n = 159). The former included participants who were recruited at the driving school since they had committed road traffic offences or been disqualified from driving. As there were very few women in the point recovery courses, only men were included in this sample. The latter enlisted participants without road traffic offences by convenience (intentionally) in public places (e.g., garages, cafes) matched in driving frequency, educational level and age (± 3 years old, so that there was not a significant difference in age between both groups, t=-0.22, p = .859, d=-0.02).

2.2 Measures

Sociodemographic and driving frequency questions were used to record variables such as age, gender, education and driving frequency of daily driving.

A Spanish version of the Difficulties in Emotion Regulation Scale (DERS) (Hervás and Jódar 2008). A 28-item self-report questionnaire measuring difficulties in several abilities to regulate emotions. The Spanish version of DERS consists of five subscales: (a) nonacceptance (non-acceptance of emotional responses), (b) goals (difficulty in adopting goal-oriented behaviors), (c) impulse (difficulty in controlling impulses and limited access to ER strategies), (d) awareness (lack of emotional awareness) and (e) clarity (lack of clarity in identifying one’s emotions). This version contains 5 reverse-scored items. The format is 5-Likert response (ranging from 1 = almost never, 2 = sometimes, 3 = half the time, 4 = most of the time, 5 = almost always). The scores on each of the subscales and on the total scale are calculated by adding them together; high scores on the total scale indicate difficulties in the skills of ER. Cronbach’s alpha for the total score both in the original version and the Spanish version was 0.93 (Gratz and Roemer 2004; Hervás and Jódar 2008). In the validation of the Spanish version, CTT was used and so the functionality of the categories was not tested (Hervás and Jódar 2008).

2.3 Procedure

The study was approved by the Dirección General de Tráfico (DGT), two driving schools in Salamanca (Spain) and the bioethics committee of the University of Salamanca. All potential participants received a brief explanation about the aims of the research and their confidentiality was ensured. The total time required to complete the questionnaire for each participant was 15–20 min.

2.4 Data analysis

Responses to DERS were analyzed under RSM (Andrich 1978). Data analysis was performed using the computer program Winsteps 4.7.0 (Linacre 2021). The performance of the response categories was examined empirically according to Linacre (2002) criteria. Unidimensionality was tested by means of the Principal Components Analysis (PCA) of the residuals (Chou and Wang 2010). It is recommended that the variance explained by Rasch measure be at least 40% and the eigenvalue for the first contrast be less than 3.0. If these requirements are not met, looking for items that present factor loadings of 0.50 or higher on the first contrast (component) of the residuals and point-polyserial correlation less than 0.20 between raw scores of item and test can help to signal items that work together in a second dimension (Linacre 2021). Non-compliance with the local independence requirement was analyzed using Yen’s Q3 statistic. This statistic quantifies residuals correlations. High positive correlations indicate local dependence (Linacre 2022). In practical terms, correlations over 0.70 would be clearly indicative of local dependence (Linacre, 2013).

Data-model fit was assessed by fit statistics: infit (mean square of the weighted standardized residuals by the information function) and outfit (mean square of the unweighted standardized residuals). Infit is more sensitive to unexpected responses to items close to the subject’s location. Outfit is more sensitive to unexpected responses to items far away from the subject’s location (outliers). Infit/outfit values over 2 degrade measurement (Linacre 2021). Reliability statistics reported by Winsteps, such as Item Separation Reliability (ISR) and Person Separation Reliability (PSR), assess the accuracy of the item-person estimation by indicating the proportion of the observed variance that is reproducible from Rasch model. ISR and PSR values over 0.70 are recommended to achieve a suitable measure.

There will be DIF when individuals with the same level in the measured attribute (e.g., difficulties in emotion regulation) who belong to different groups (e.g., reference group and focal group) do not have the same likelihood of getting a correct response to an item (Prieto y Nieto, 2014). DIF is a systematic error that can affect the validity of the scores obtained on the test. The empirical evidence of presence of DIF is confirmed when the difference between location item parameters for each group is over |0.64| logits and statistically significant, considering Welch’s t-test and Bonferroni-corrected alpha levels (Linacre 2021). Relevant evidence for analyzing criterion validity is the impact on the scores of the differences between groups in the attribute measured. In this study, the impact was operationalized as the difference between DERS mean scores of drivers with road traffic offences and matched drivers. To do so, Welch’s t-test (which is more appropriate than Student’s t-test when unequal variances are assumed) and Cohen’s d (as a measure of effect size) were calculated.

3 Results

The psychometric quality of the five original response categories of DERS was analyzed. The threshold between category 3 (about half the time) and category 4 (most of the time) was disordered with respect to the other thresholds. Category 3 was not the most likely category in any range of the variable. Table 1 shows that Linacre (2002) criteria were not met for the five original categories. Therefore, it was necessary to collapse the central categories (2 + 3 + 4). The performance of the three resulting categories were adequate (Table 2). Subsequent analyses were carried out with the data derived from the three collapsed categories (1 = almost never, 2 = about half the time, 3 = almost always).

Table 1 DERS: Analysis of the Five Original Categories

The PCA of the residuals shows that the percentage of variance explained by Rasch dimension was 30.9% and the eigenvalue of the first contrast of the unexplained variance was 3.57. That indicated the lack of unidimensionality. Items 1, 2, 6, 7 and 9 showed factor loading over 0.50 in a second dimension and point-polyserial correlation among these items and the raw score of DERS showed low values (between 0.20 and 0.31). All these items shared a common feature, they were reverse-scored items. Data analysis was carried out again after deleting these 5 items. Thus, the percentage of variance explained by the Rasch dimension was 38%, and the eigenvalue of the first contrast of the unexplained variance was 2.44. Therefore, there was enough evidence for essential unidimensionality. As to local independence, there were no high positive correlations between pairs of residuals, which ranged from − 0.19 to 0.32. Linacre (2002) criteria for the three categories (1 = almost never, 2 = about half the time, 3 = almost always) were met after deleting the 5 reverse-scored items from the DERS (Table 2). The rest of the analyses were carried out considering 23 items.

Table 2 DERS: Analysis of the Three Collapsed Categories

Count: number of counts observed in each category; Mean: Mean of the differences in each category between the person and item parameters; Threshold: value between adjacent categories.

There was good data-model fit, mean infit was 1.00 (SD = 0.2) and mean outfit was 0.98 (SD = 0.2). Table 3 shows that no item showed overfitting (infit or outfit < 0.50) or severe misfit (over 2.0). For persons, mean infit was 1.00 (SD = 0.5) and mean outfit was 0.98 (SD = 0.6). Sixteen participants (5.1%) showed infit and/or outfit over 2.

Table 3 DERS Item Statistics

There were no extreme scores for items. Regarding persons, six participants got the minimum score (-6.81 logits). Person measures are located on the left of the Wright map while the right side shows item locations (Fig. 1). Variability can be seen higher in persons than in items. Average item location is conventionally 0.0 logits. Average person aptitude was − 2.1 logits (SD = 1.7). Both item separation reliability (ISR = 0.97) and person separation reliability (PSR = 0.89) were good enough. No items showed group-related DIF (Table 4). The mean Rasch measure was similar in the group with road traffic offences (-2.09 logits) and in the group of matched controls (-2.12 logits). Therefore, statistically significant group differences (impact) in Rasch measures were not found, Welch-t (312) = 0.13, p = .89, d = 0.00.

Fig. 1
figure 1

DERS: Conjoint Measurement Wright Map

Table 4 DERS: Group-Related DIF

4 Discussion

The current study examined the psychometric properties of DERS scores in a Spanish driver’s sample, half of them offenders. The 5 response categories of the original DERS item format showed inadequate performance since thresholds between intermediate categories were disordered. Collapsing categories improved empirical and theoretical validity of results (Wright and Linacre 1992). In this study, 3 categories showed adequate functionality.

The dimensionality analysis indicated that the second dimension was related to the 5 reverse-scored items. Including this sort of item could jeopardize unidimensionality due to secondary variance sources since it will be taken for granted (disputably) that response category meaning is the same in direct and reverse-scored items (Suárez et al., 2018). Moreover, these items had common theoretical content as they formed the awareness subscale. Low correlations between this subscale and the test had already been shown in previous studies and versions of DERS (Pérez-Sánchez et al. 2020b). Given that the awareness subscale is related to attentional processes to regulate emotions, a possible explanation might be that these processes can be both beneficial (promoting self-efficacy of regulation) and detrimental (increasing ruminative processes) as shown by Lischetzke and Eid (2003). At any rate, it does not seem adequate to propose a total score when there is some evidence of a second dimension, as is the case. After deleting the 5 items, the unidimensionality requirement was achieved. Data-fit was good enough. Both item separation reliability and person separation reliability were good. Finally, DIF must be discarded to adequately interpret differences between mean scores of groups. We found neither group-related DIF nor impact. On average, participants did not show difficulties regulating their emotions.

Many studies showing a relationship between ER difficulties and driving attitudes and/or behaviors have used the DERS score as an independent variable without any consideration for its psychometric properties. For example, the performance of the response categories of scales with polytomous response formats is not always tested (Bec-Gérion and Gaymard 2022) even if the goal of the research is to examine the functionality of the response categories. As to evidence based on the internal structure, Žardeckaitė-Matulaitienė et al. (2020) indicated that ER difficulties might be the target to address in early interventions assuming that ER represented a general phenomenon without having carried out a dimensionality analysis. Local independence is a requirement associated with unidimensionality that can be violated when items are redundant in such a way that the response to one item determines the response to another one. Local dependence increases homogeneity and spuriously inflates reliability indicators such as internal consistency (e.g. Cronbach’s alpha). When there is a violation of the local independence requirement, some of the dependent items might have to be removed.

As to the design of previous studies carried out with the DERS, significant correlations have been found between ER difficulties and various driving styles, such as risky driving (Navon-Eyal and Taubman-Ben-Ari 2020; Trógolo et al. 2014; Žardeckaitė-Matulaitienė et al. 2020). However, risk while driving was not assessed as a factual conduct but as a self-report of personality tendencies, and no comparison group was used to control for extraneous variables. Conversely, the present study groups were formed taking into account their road traffic offences, and matched on age, educational level and driving frequency.

With respect to the sample, this study has been carried out with only male samples, given the scarcity of females with road traffic offences. Even though the number of female drivers has grown in the last few decades, in Spain there are not many who have committed road traffic offences and attended driving school courses. Therefore, the gender variable was controlled by elimination. Although the percentage of those women in Spain may seem to be similar in the international context, in the future it would be interesting to try to carry out studies with female traffic offenders.

To conclude, the criterion validity in the context of traffic of the most commonly used ER difficulty scale should be seriously questioned. Five malfunctioning reverse-scored items had to be deleted. Given that our study was designed both considering the psychometric properties of the scale and controlling for relevant variables, the lack of differences between the two groups –with and without traffic violations– should be more trustworthy than results from less controlled correlational studies. This lack of impact could be due to the fact that the content of the DERS items is not explicitly related to traffic contexts. Future research in this field should include items adapted to the traffic context and analyze the psychometrics properties of those tests in other samples of drivers.

From a different perspective, but also with the aim of solving practical problems in the traffic context, new research tactics, such as participatory research strategies (Bleijenbergh et al. 2011), could be used in a complementary way in future studies.