1 Introduction

Path dependency refers to the impact of a method’s problem solving path on its outcome (Hämäläinen and Lahtinen 2016). In general, a path is a sequence of steps undertaken in a problem solving process and might include, for example, an order in which the different parts of the model are specified and solved, or a way in which data or preferences are collected and processed (Hämäläinen and Lahtinen 2016). In Hämäläinen and Lahtinen (2016), seven interacting origins of path dependence are identified: systemic origins, learning, procedure, behavior, motivation, uncertainty, and external environment.

In general, the path dependence of a model or method, for instance, a multiple-criteria decision-making method, is perceived as a negative phenomenon: path dependency of a problem solving method means that, when some steps of the method are performed in a different order, the outcome (usually the best solution) might also be different, which is, of course, an undesirable result.

Interestingly, studies on this topic in the field of operations research are still rare in the literature. The possibility that different ‘valid’ modeling paths lead to different outcomes was acknowledged by Landry et al. (1983) in the 1980s, but the topic received little interest in the operations research community. Three decades later, path dependence attracted the attention of the studies (Hämäläinen and Lahtinen 2016; Lahtinen and Hämäläinen 2016; Lahtinen et al. 2017), where the latter examined path dependence in the even-swaps method.

The influence of a selected comparison scale on outcomes of multiple criteria decision-making methods, i.e. the scale dependence, has been examined much more. In particular, the problem of a scale in the analytic hierarchy process (AHP) has been studied extensively in recent decades, see e.g. Franek and Kresta (2014), Harker and Vargas (1987), Ishizaka et al. (2010), Leskinen (2008), Ma and Zheng (1991), Poyhonen and Hämäläinen (2001), Salo and Hämäläinen (1997), Setyono and Sarno (2018). For the AHP, Saaty proposed the use of the so-called’fundamental scale’ from 1 to 9 (with reciprocal values) based on psychological arguments, see Saaty (1977). Later, many other scales, such as exponential, logarithmic, interval and so on, were proposed for the AHP, but no scale appeared to be superior to the others in general. In the BWM, the same scale from 1 to 9 is applied, though in the original paper (Rezaei 2015) Rezaei notes that other scales can be used as well. However, the effect of the scale on the BWM’s outcomes is currently unknown.

The Best–Worst method (BWM) proposed by Rezaei in 2015 (Rezaei 2015) is one of the most recent contributions to the decision-making methods based on pairwise comparisons, and immediately after its introduction gained significant popularity among researchers and has been applied in many areas of human activity, such as waste management, tourism, sustainability or biochemistry (Abadia et al. 2018; Ahmadi et al. 2017; Chang et al. 2019; Gupta and Barua 2017; Mi et al. 2019; Rezaei et al. 2016; Thurstone 1927).

In the BWM, a decision maker compares each criterion (or possibly some other object, alternative, sub-criterion, etc.) only with the best (most important) criterion and the worst (least important) criterion, and then the weights of all criteria are determined by the solution of a (non)linear programming problem. The appeal of the BWM lies in its simplicity and the smaller number of pairwise comparisons necessary to be performed when compared to the AHP, however, as shown in Mazurek et al. (2021), its robustness is weaker than for the geometric mean method and the eigenvalue method. The problem of whether the BWM is path independent, or scale independent (robust to changes in which comparisons are performed, or robust to changes in scales by which these comparisons are performed) is currently unresolved, as no study on the topic has been published in the literature so far.

Therefore, the primary objective of this paper was to examine the path and scale dependency of the linear version of the BWM, which does not suffer the problem of possible multiple solutions as its non-linear version, via an experiment in which more than 800 respondents from two countries (Czechia and Poland) pairwise compared by estimation (without measuring with a ruler, or calculating) areas of six geometric objects presented in different orders and via different comparison scales. Altogether, five distinct questionnaire forms were distributed among the respondents. Subsequently, these preferences served as an input for the BWM and the output consisted of relative sizes of the compared objects for each form.

Afterwards, the path and scale differences among forms were tested by ANOVA and MANOVA, the multivariate extension of ANOVA, see Allen et al. (2018), Barker and Barker (1984), Brown (1998), Olson (1976), Weinfurt (1995), Zaointz (2022). Post-hoc analysis of the results was performed as well. The secondary objective of our study was to examine the accuracy of the BWM with respect to different scales, including Saaty’s nine-point linguistic scale, integer scale and continuous scale.

The study falls, at least partially, into behavioral operational research (BOR) category, that is the research field dedicated to understanding how the behavior of human actors influences their decisions, see e.g. Brocklesby (2016), Franco et al. (2021), Hämäläinen et al. (2013), Kunc et al. (2016). In particular, the BOR focuses on cognitive biases, which are systematic errors in human judgements. That’s why the problem of a possible cognitive bias in the presented experiment is discussed in a separate section (Discussion see Sect. 5) as well.

The data that support the findings of this study are available from the corresponding author upon request.

The paper is organized as follows: Sect. 2 provides an introduction to the Best–Worst method, the experiment is described in Sect. 3, Sect. 4 summarizes the experiment results, in Sect. 5 a brief Discussion is provided and the Conclusions (Sect. 6) close the article.

2 The best–worst method

In the Best–Worst method, see Rezaei (2015, 2016), each criterion is pairwise compared only with the best criterion and the worst criterion.

The Best-Worst method proceeds as follows (Rezaei 2015):

  • Step 1. A set of criteria is determined.

  • Step 2. The decision maker identifies the best (most desirable, most important) criterion and the worst (least desirable, least important) criterion.

  • Step 3. Preferences of the best criterion with respect to all other criteria are determined on the scale from 1 (equal preference) to 9 (absolute preference).

  • Step 4. Preferences of all other criteria with respect to the worst criterion are determined on the scale from 1 to 9.

  • Step 5. Optimal weights of all criteria are found by solving a corresponding non-linear programming problem.

Let cBj denote the preference of the best criterion (B) over the criterion (j), whereas let ciW denote the preference of the criterion (i) over the worst criterion (W). Let wB and wW be the weights of the best and worst criterion, respectively. The goal is to find the vector of criteria weights (a priority vector) w = (w1,w2,…,wn).

The priority vector is found as a solution of the following optimization problem (Rezaei 2015):

$$ \min \left( {\mathop {\max }\limits_{j} \left\{ {\left| {\frac{{w_{B} }}{{w_{j} }} - c_{Bj} } \right|,\;\left| {\frac{{w_{j} }}{{w_{W} }} - c_{jW} } \right|} \right\}} \right) $$
(1)

s.t.

$$ \sum\limits_{j = 1}^{n} {w_{j} = 1} $$
(2)
$$ w_{j} \ge \, 0,\quad \forall j = { 1}, \ldots ,n. $$
(3)

The problem can equivalently be stated as follows:

$$ {\text{min}}\;\xi $$
(4)

s.t.

$$ \left| {\frac{{w_{B} }}{{w_{j} }} - c_{Bj} } \right| \le \xi ,\quad \forall j = \, 1, \ldots ,n, $$
(5)
$$ \left| {\frac{{w_{j} }}{{w_{W} }} - c_{jW} } \right| \le \xi ,\quad \forall j = \, 1, \ldots ,n, $$
(6)
$$ \sum\limits_{j = 1}^{n} {w_{j} = 1,} $$
(7)
$$ w_{j} \ge \, 0,\quad \forall j = { 1}, \ldots ,n. $$
(8)

Further on, it is assumed that for all j the following inequalities hold:

$$ \text{C}_{\text{BW}}\; \ge \;\text{C}_\text{jW}\; \ge \;{1};\quad \text{C}_\text{BW}\; \ge \;\text{C}_\text{Bj}\; \ge \;{1}. $$
(9)

A linear version of the BWM was introduced by Brunelli and Rezaei (2019), Rezaei (2016), where the letter ‘L’ denotes linear:

$$ {\text{min}}\;\xi_{L} $$
(10)

s.t.

$$ \left| {w_{B} - c_{Bj} w_{j} } \right| \le \xi_{L} ,\quad \forall j = \, 1, \ldots ,n, $$
(11)
$$ \left| {w_{j} - c_{jW} w_{W} } \right| \le \xi_{L} ,\quad \forall j = \, 1, \ldots ,n, $$
(12)
$$ \sum\limits_{j = 1}^{n} {w_{j} = 1,} $$
(13)
$$ w_{j} \ge \, 0,\quad \forall j = { 1}, \ldots ,n. $$
(14)

The solution of the model above is denoted as w and the corresponding value of ξL can be considered the degree of inconsistency of preferences. Notice that the solution to the linear version of the BWM differs from the solution to the non-linear version in general (Beemsterboer et al. 2018).

When comparing n objects pairwise, the analytic hierarchy process (AHP), which is arguably the most popular pairwise comparisons method, requires n(n − 1)/2 comparisons to be made. The BWM requires only comparisons with the best and worst object, and the reduced number of comparisons amounts to 2n − 3. This reduction might be very important when dealing with a large number of objects to compare.

3 The experiment, research hypotheses and the data

3.1 The description of the experiment

For the investigation of the path and scale dependency of the Best–Worst method, the following experiment was designed.

Participants of the research respondents were university undergraduate students aged 19–22. The study was anonymous and the authors had no access to information that could identify individual participants. Questionnaires were distributed to respondents in classrooms in groups from 10 to 25 and respondents consented to participate in the study verbally at the beginning of the experiment.

Then, respondents were asked to pairwise compare by estimation (measurements with a ruler, or calculations, were not allowed), the areas of six geometric objects, see Fig. 1 and a sample (filled) questionnaire in Appendix A. Respondents answered simple questions of the following type: How many times is the area of the triangle greater than the area of the circle?, and the answer was written into a predefined blank spot.

Fig. 1
figure 1

The order of compared objects in questionnaires: a A, B, D, and E; b C

Questionnaires were divided into 5 different forms: A, B, C, D, and E (see the description below), with the same figures of exactly the same size, but with different orders of the presented figures or a different comparison scale. The questionnaires were distributed in a printed (paper) form.

Altogether, 846 respondents from four universities in Czechia and Poland; Silesian University in Opava (CZ), Tomas Bata University in Zlin (CZ), Rzeszów University of Technology (PL) and Carpathian State College in Krosno (PL), took part in the experiment.

The respondents consisted of 470 men (55.6%) and 376 women (44.4%). Each respondent filled exactly one questionnaire (one form). The numbers of respondents and their gender are reported in Table 1. The ratio of men and women for each questionnaire was roughly the same to minimize possible gender differences in the perception of the figures. Questionnaires were printed in respondents’ native language, that is Czech or Polish.

Table 1 Forms and respondents’ numbers

The figure with the largest area (Best) was the trapezoid, the figure with the smallest area (Worst) was the circle. figures’ areas were generated randomly. Further on, the two orderings of figures shown in Figs. 1a,b) and used for the experiment were also generated randomly (as a permutation of six objects).

  • Form A: See Fig. 1a. All objects were compared with the Best object, and then with the Worst object via scale [1,∞[.

  • Form B: See Fig. 1a. All objects were compared with the Worst object, and then with the Best object via scale [1,∞[.

  • Form C: See Fig. 1b. All objects were compared with the Best object, and then with the Worst object via scale [1,∞[, but the order of compared objects was different than for other forms.

  • Form D: See Fig. 1a. All objects were compared with the Best object, and then with the Worst object via Saaty’s scale from 1 to 9 (with reciprocals).

  • Form E: See Fig. 1a. All objects were compared with the Best object, and then with the Worst object via Saaty’s linguistic scale, see Saaty (1977, 1980).

The weights of objects (their relative sizes) from questionnaires A-E were derived via a linearized version (Eqs. (10)–(14)) of the BWM. The experiment results are provided in the next section.

3.2 The research hypotheses

In order to investigate path and scale dependence of the BWM, the following null hypotheses were formulated and tested (the letter µ denotes the mean value operator).

$$ H_{{0{1}}} :\mu \left( {w_{i}^{A} } \right) = \mu \left( {w_{i}^{B} } \right) = \mu \left( {w_{i}^{C} } \right),\;i \in \left\{ {{1}, \ldots ,{6}} \right\}. $$

The hypothesis H01 deals with the path dependency of the Best–Worst method, which, in the experiment’s setting, can be realized in two different ways: rstly, the order of comparisons with the Best and Worst object can be reversed, and, secondly, the order of comparisons of individual pairs can change, see Fig. 1. If H01 holds for all indices i, then the mean values of weights of all objects are the same in cases of questionnaires A, B and C (which differ in paths, but not scales), which means the BWM is not path dependent. Otherwise, the null hypothesis is rejected and the BWM is path dependent.

Further on, both cases of’path changes’ mentioned above are examined separately, hence, two additional null hypotheses were formulated:

$$ H_{{0{1}a}} :\mu \left( {w_{i}^{A} } \right) = \mu \left( {w_{i}^{B} } \right),\;i \in \left\{ {{1}, \ldots ,{6}} \right\}. $$
$$ H_{{0{1}b}} :\mu \left( {w_{i}^{A} } \right) = \mu \left( {w_{i}^{C} } \right),\;i \in \left\{ {{1}, \ldots ,{6}} \right\}. $$

The hypothesis H01a states that results of the BWM do not depend on the order of comparisons of all objects with respect to the Best and the Worst object, respectively.

The hypothesis H01b states that results of the BWM do not depend on the order in which all objects are mutually compared pairwise.

The next hypothesis H02 deals with the scale dependency of the BestWorst method:

$$ H_{{0{2}}} :\mu \left( {w_{i}^{A} } \right) = \mu \left( {w_{i}^{D} } \right) = \mu \left( {w_{i}^{E} } \right),\;i \in \left\{ {{1}, \ldots ,{6}} \right\}. $$

If H02 holds for all indices i, then values of weights of all objects are the same in cases of questionnaires A, D and E (which differ in scales but have the same path), which means the BWM is scale invariant. Otherwise, the null hypothesis is rejected and the BWM is scale dependent.

As with the hypothesis H01, one particular subcase of H02 is examined as well, namely difference of the BWM results for integer Saaty’s scale and real scale from 1 to in nity, that is questionnaires A and D:

$$ H_{{0{2}a}} :\mu \left( {w_{i}^{A} } \right) = \mu \left( {w_{i}^{D} } \right),\;i \in \left\{ {{1}, \ldots ,{6}} \right\}. $$

Finally, the last hypothesis H03 deals with the accuracy of respondents’ comparisons with respect to the three comparison scales (see forms A, D and E), which is both interesting and important from a practical point of view.

Here, the accuracy is estimated via the mean relative error, where the actual relative size of all six objects is denoted as \(w^{*} = \left( {w_{1}^{*} , \ldots ,\;w_{6}^{*} } \right)\). Formally, the mean relative error of a respondent j filling the form q is given as follows:

$$ d_{j}^{q} = \frac{1}{6}\sum\limits_{i = 1}^{6} {\frac{{\left| {w_{i,j}^{q} - w_{i}^{*} } \right|}}{{w_{i}^{*} }}} $$
(15)

The following null hypothesis states that respondents were, on average, equally accurate in their judgments transformed by the BWM into the relative sizes of the compared objects for all three scales.

$$ H_{{0{3}}} :\mu \left( {d^{A}_{j} } \right) = \mu \left( {d^{D}_{j} } \right) = \mu \left( {d^{E}_{j} } \right). $$

3.3 The data

To ensure the data quality, respondents’ responses were assessed and deficient questionnaires were discarded on the following grounds:

  1. (i)

    A questionnaire was incomplete.

  2. (ii)

    A questionnaire did not conform to instructions for its filling (most often respondents used wrong scale for comparisons).

  3. (iii)

    A questionnaire included outliers.

Outliers’ identification was performed via SPSS (Tukey’s test) and via Gretl (Mahalanobis distance).

Altogether, approximately 13% of the questionnaires were removed from the dataset.

3.4 The hypotheses testing—MANOVA

In the first two hypotheses, not one independent variable, but six independent variables (areas of six geometrical objects) are assessed at once. Therefore, these hypotheses were tested via one-way multivariate analysis of variance (MANOVA). The third hypothesis included only one independent variable, hence it was tested by one-way ANOVA.

According to (Warne 2014; Zientek and Thompson 2009), MANOVA is a member of the General Linear Model class, a family of statistical procedures often used to quantify the strength between variables. MANOVA extends the capabilities of analysis of variance (ANOVA) by assessing multiple dependent variables simultaneously. This provides several advantages: when the dependent variables are correlated, MANOVA can identify effects that are smaller than those that regular ANOVA can detect. Further on, MANOVA can assess patterns between multiple dependent variables, which ANOVA cannot. Additionally, MANOVA limits the joint error rate. When a series of ANOVA tests is performed, the joint probability of rejecting a true null hypothesis increases with each additional test (thus Bonferroni or another correction is necessary), but in MANOVA the error rate equals the significance level.

Similarly to ANOVA, MANOVA has several assumptions (Anderson et al. 1996; Zientek and Thompson 2009):

  • Observations are randomly and independently sampled from the population.

  • Each dependent variable is measured at the interval or ratio level.

  • An independent variable consists of two or more categorical (independent) groups.

  • Dependent variables are multivariate normally distributed.

  • The population covariance matrices of each group are equal (homogeneity of variance–covariance matrices).

Other sources add the absence of outliers and the absence of multicollinearity of dependent variables as well, see e.g. Barker and Barker (1984), Finch (2005).

MANOVA provides four statistics for hypothesis testing: Pillai’s trace, Wilk’s lambda, Hotelling’s trace and Roy’s greatest root. In the case of two groups (see hypotheses H01a,H01b and H02a), all the statistics are equivalent and the test reduces to Hotelling’s T-square.

It should be noted that though MANOVA is a very useful statistical tool, it has also its limitations. Discussion continues over the merits of each statistic mentioned above, see e.g. Weinfurt (1995), and about violation of MANOVA’s assumptions on its preformation. In particular, according to Finch (2005), MANOVA is robust against departures from multivariate normality especially when the number of data points is large. The study (Knief and Forstmeier 2021) found that Gaussian models (such as ANOVA) are remarkably robust to non-normality over a wide range of conditions, meaning that P-values remain fairly reliable except for data with influential outliers. Also, it is argued in Allen et al. (2018); Olson (1976) that MANOVA is robust against violations of homogeneity of variance–covariance matrices assumption. When there is a violation of the equality of variances, Pillai’s trace is the most suitable characteristic for MANOVA, as it is highly robust to many violations of the assumptions of MANOVA, see e.g. Allen et al. (2018), Finch (2005), Olson (1976).

4 Results of the experiment

After the data were gathered from the participants of the experiment, the weights of all six geometric objects corresponding to their relative sizes were calculated by the linear version of the BWM (Eqs. (10)–(14)) for each respondent/questionnaire. Afterwards, the data were checked for outliers, which were removed from the dataset.

4.1 Descriptive statistics

Next, the descriptive statistics of all questionnaire forms A-E were performed separately for the sake of comparison. The relative sizes of the compared geometric objects in the form of the weights wiq, where i ∈ {1,,6}, q ∈ {A,B,C,D,E} are summarized in Table 2. The last row of Table 2 contains actual relative sizes of all six objects, see also Fig. 2. As can be seen, objects’ relative sizes derived from questionnaire forms A-E were close to each other (with the exception of Trapezoid in the questionnaire E) and to the actual relative sizes of objects. Interestingly, respondents of forms A-D underestimated the area of the Trapezoid on average and, simultaneously, overestimated the area of the Square, Arrow, L-Shape and Circle.

Table 2 Experiment results: the mean weights (relative sizes) of all geometric objects for all questionnaire forms with the standard deviation in parentheses and the number (N) of questionnaires (after the data quality check)
Fig. 2
figure 2

The relative sizes of all objects with respect to all questionnaire forms (A)–(E)

In many real-world problems, the ranking of compared objects is more important to a decision maker than precise values of a priority vector. The ranking of all six geometric figures with respect to questionnaire form is shown in Table 3. The correct ranking was obtained from forms A, B and D. Form E contained one discordant pair (Arrow-Square), while form C contained two discordant pairs (L-Square and L-Arrow). The difference between the ranking A (or B and D) and C can be expressed via Kendall’s rank correlation coefficient as τ(A,C) = 0.733. Other pairwise Kendall’s rank correlation coefficients were even higher, hence the rankings obtained from different questionnaire forms were highly correlated (similar).

Table 3 Experiment results: the ranking of figures’ sizes in descending order for all questionnaire forms

4.2 Path dependence of the BWM

To test the path dependence of the BWM (the null hypotheses H01, H01a and H01b), the data le containing six dependent variables weights of six geometric objects corresponding to their relative sizes (the variables are denoted simply as Trapezoid, Square, Triangle, Arrow, L and Circle) and one independent variable the questionnaire form (A, B and C), was prepared.

Before the null hypotheses were tested via SPSS, the assumptions of MANOVA were checked:

  • The data were properly randomly and independently sampled from the population.

  • The dependent variables were ratios.

  • The independent variable consisted of three (two) independent groups.

  • The correlation matrix revealed no significant multicollinearity of the dependent variables (no correlation coefficient exceeded the absolute value of 0.40).

  • Dependent variables (for all questionnaire forms) were tested for normality in Gretl via Shapiro–Wilk test. The result in the form of p-values shown in Table 4. As can be seen, normality couldn’t be rejected at 0.001 level for all variables.

  • The homogeneity of variance–covariance matrices was tested via Levene’s test in SPSS. This property was violated in the case of Triangle, Arrow, L-shape and Circle at 0.001 level.

Table 4 p-values for the null hypothesis that the data follow the normal distribution via Shapiro–Wilk test in Gretl

Though the last property was not satisfied for four objects, MANOVA is robust against the violation of the homogeneity variance–covariance matrices as pointed out in the Sect. 3.4.

Therefore, the hypothesis H01 was tested by MANOVA.

MANOVA SPSS’ output is shown in Table 5. According to all four test characteristics: Pillai’s trace, Wilks’ Lambda, Hotelling’s Trace and Roy’s.

Table 5 Multivariate tests for H01, SPSS output

Largest Root, the hypothesis H01 was rejected at the 0.001 significance level. Therefore, it can be concluded that the weights corresponding to the relative sizes of six geometric objects obtained from questionnaires A, B and C differed significantly, hence the BWM was found to be path dependent.

Since the hypothesis H01 was rejected, the MANOVA was followed by post-hoc analysis to determine the source of the differences behind a rejection of a null hypothesis (Weinfurt 1995). Separate ANOVA tests of Between-Subjects Effects revealed that the most statistically significant differences occurred for the Triangle, Arrow and L-shape. Consequently, pairwise Fisher’s Least Significant Difference (LSD) tests were performed to find out a statistical significance of differences in area estimations of the compared objects with respect to the three questionnaire forms. The cases with the statistical significance lower than 0.001 include the Square in the forms A-C, Triangle in forms B-C, the Arrow in forms A-C and B-C, and finally the L-Shape in B-C. These are the greatest differences among forms and thereby the main sources of the rejection of the hypothesis H01.

Next, the hypotheses H01a and H01b dealing with different forms of path dependency were tested by MANOVA via SPSS as well.

In the case of the hypothesis H01a (forms A and B), the homogeneity of variance–covariance matrices (tested via Levene’s test) was satisfied for all objects with the only exception of Circle. The SPSS output of the hypothesis test is shown in Table 6. As can be seen, the hypothesis was rejected at 0.001 level. This means that the BWM results depended on the order of comparisons of all objects with the Best and the Worst object, respectively.

Table 6 Multivariate tests for H01a, SPSS output

In the case of the hypothesis H01b (forms A and C), the homogeneity of variance–covariance matrices (tested via Levene’s test) was satisfied for Trapezoid, Square and Circle, and violated for the rest. The SPSS output of the hypothesis test is shown in Table 7. As can be seen, the hypothesis was rejected at 0.001 level. This means that the BWM results depended on the order of mutual pairwise comparisons of all objects.

Table 7 Multivariate tests for H01b, SPSS output

4.3 Scale dependence of the BWM

Originally, the most suitable scale proposed for the BWM was Saaty’s (numerical) scale from 1 to 9 (Rezaei 2015), nevertheless, the author mentioned that other scales can be used as well. Therefore, in this study, Saaty’s linguistic scale (see form E) and a continuous scale (form A) are considered as well and compared with the integer 1–9 scale (form D). A general discussion on the type of scale that can or cannot be used in pairwise comparisons can be found in Koczkodaj et al. (2020), Mazurek (2023). After respondents provided their answers, Saaty’s linguistic scale was transformed (for obvious computational reasons) to the integer 9 point scale per Saaty’s mutual correspondence (’equal size’ = 1,’equally to moderately larger’ = 2, etc.), though this correspondence was criticized in the past (however, this correspondence is still widely used in practice).

To test scale dependence of the BWM, that is the null hypothesis H02, the data file containing six dependent variables weights of six geometric objects corresponding to their relative sizes (Trapezoid, Square, Triangle, Arrow, L and Circle) and one independent variable the questionnaire form (A, D and E), was prepared in the same way as in the previous section.

Before the testing of the hypothesis H02 via MANOVA in SPSS, MANOVA assumptions were checked in the same way as in the previous section for hypothesis H01. The data satisfied all assumptions with the only one exception regarding homogeneity of variance–covariance matrices, which was violated for Triangle.

The result of the test of the hypothesis H02 via MANOVA is reported in Table 8. According to all four test characteristics: Pillai’s trace, Wilks’ Lambda, Hotelling’s Trace and Roy’s Largest Root, the hypothesis was rejected at the 0.001 significance level. Therefore, it can be concluded that the weights corresponding to the relative sizes of six geometric objects obtained from questionnaires A, D and E differed significantly, hence the BWM was found to be scale dependent.

Table 8 Multivariate tests for H02, SPSS output

Post-hoc analysis revealed that the most significant differences among the questionnaire forms occurred for all geometric figures with the exception of the Circle. Fisher’s Least Significant Difference (LSD) pairwise tests found that the relative sizes of the Trapezoid, Square, Triangle, Arrow and L-Shape were all statistically different at the 0.001 level for the forms A-E and D-E. Since form E included a linguistic scale, it can be concluded that estimates with this scale differed from the numerical scales in forms A and D.

Next, the null hypothesis H02a dealing with 1–9 scale (form D) and [1,∞) scale (form A) was tested via MANOVA as well. All MANOVA assumptions were checked and found satisfied including homogeneity of variance–covariance matrices for all six objects. The result of MANOVA is reported in Table 9. All four MANOVA statistics showed p = 0.054, which means the H02a hypothesis couldn’t be rejected at 0.05 level. Therefore, it can be concluded that there is no statistically significant difference in the use of both scales.

Table 9 Multivariate tests for H02a, SPSS output

4.4 Accuracy of the BWM

In addition to the investigation of the path and scale dependency of the BWM, the accuracy of respondents’ estimations was examined for each form A-E via relation (15). The results are shown in Table 10 and Fig. 3.

Table 10 Average respondents’ mean relative error (µ(dqj)) in % for all questionnaire forms
Fig. 3
figure 3

Mean relative error in estimations of figures’ areas for all questionnaire forms and all figures

Respondents of the questionnaire form A were the most precise in their estimations (with the mean relative error of 13.1%), while respondents of the forms C (with the mean relative error of 17.4%) were least precise in their judgments. As for geometric figures, respondents were most accurate in the estimation of the relative size of Trapezoid (the mean relative error of 9.4%) and least accurate for Arrow (the mean relative error of 19.5%), probably due to its complex shape.

Next, the accuracy of the BWM with respect to three different scales continuous (form A), integer (form D) and linguistic (form E) was evaluated.

The null hypothesis H03 was tested via one-way ANOVA, where the independent variable was the form (i.e. scale) and the dependent variable was the mean relative error, see relation (15). The dataset contained only one outlier, which was removed.

Before the testing assumptions of ANOVA were checked. The normality of the data couldn’t be rejected at 0.001 level. Multicollinearity of the data was not detected (correlation coefficients were lower than 0.10 in the absolute value). The equality of variances couldn’t be rejected at 0.01 level.

The results of ANOVA are shown in Table 11. The p-value was 3.3·10−5, which means the hypothesis H03 can be rejected at both 0.05 and 0.01 level. Therefore, comparisons’ scale had statistically significant impact on accuracy of comparisons.

Table 11 ANOVA results for the hypothesis H03

The lowest mean relative error in comparisons occurred in the case of form A, that is continuous scale from 1 to infinity. It is a rather expected result since allowing decision makers to use continuous scale rather than integer scale may contribute to more accurate judgments in cases when, for example, a decision maker is not sure whether an object M is 2 times or 3 times more important (more preferred, bigger, etc.) than an object N. Then, a value between 2 and 3 can be assigned. The integer 1–9 scale (form D) and linguistic scale (form E) were found almost identically accurate.

5 Discussion

The results of the experiment described in the previous section suggest that the Best–Worst method (BWM) is both path and scale dependent. Reasons behind this outcome might be twofold.

Firstly, one possible cause is human cognitive bias. It is well-known that human thinking is susceptible to many systematic errors such as anchoring bias, attentional bias, attribution bias, framing effect, confirmation bias, recency bias, response bias and many others. Further on, there are many studies on human perception regarding geometric shapes and their areas, see e.g. Krider et al. (2001) for a review. When comparing areas of geometric objects, two main factors are shape and size. Researchers found, for example, that triangles have been generally found to be perceived larger than circles and squares (of the same area), or that elongated figures were perceived larger (Krider et al. 2001). Also, areas of complex shapes were found harder to be accurately estimated. As can be seen from Fig. 3, in our experiment the area of the most complex shape, Arrow, was estimated with the greatest error indeed.

Latimer et al. (2000) investigated performance in the perception of simple geometric forms placed at the top and to the right of the visual eld rather than top-left, bottom-right or bottom-left with the result that figures at top-right were processed faster than others, and this ‘top-right’ bias was statistically signi cant. Therefore, the placement of figures to be compared matters. At last, but not least, a loss of focus might have occurred to respondents: they might focus on rst few comparisons more than on the last ones.

Secondly, the path and scale dependence of the BWM might be a consequence of the transformation of respondents’ judgments into objects’ weights via linear model (Eqs. (10)–(14)).

The design of the conducted experiment does not allow to make conclusions about the cause of the dependence since its objective was different. However, it is likely that both human bias and transformation of judgments played a role. It should be noted that if a human cognitive bias is the root of BWM’s path and scale dependency, then this bias will be very likely present also in other pairwise comparisons methods of a similar design such as the analytic hierarchy process (AHP).

It should be noted that the experiment described in previous sections was based on one criterion only: the area of figures, while the BWM is generally a multiple-criteria decision aiding method. However, one criterion is sufficient for the investigation of path and scale dependency of the BWM and enables to draw conclusions without the necessity of tackling complexity of multiple criterial design. Nevertheless, in the multiple-criteria version of the BWM the so called “splitting bias” might be present (when one criterion is split into two different criteria then a method produces different results) as well, see Hämäläinen and Alaja (2008).

An alternative approach for examination of the path and scale dependency of the BWM can be based on numerical simulations, which might be an interesting topic for the future research, see e.g. Lahtinen et al. (2020).

6 Conclusions

The aim of this paper was to examine the path and scale dependence of the (linearized) Best–Worst Method (BWM) and its accuracy in pairwise comparisons. For this purpose, an experiment with over 800 respondents was carried out. The respondents’ task consisted of pairwise comparisons of the area (size) of six geometric figures, where the order of pairwise comparisons and the scale for comparisons differed across questionnaire forms.

The results of the experiment showed that the BWM is both path and scale dependent at the 0.001 significance level. Therefore, apart from the BWM’s obvious advantages, a decision maker should be aware that the method, similarly to many other pairwise comparisons methods, also has its limitations (drawbacks).

Additionally, it was found that the most accurate estimations, on average, were obtained via continuous scale [1,∞), while the answers of respondents who used Saaty’s integer and linguistic 9-point scales were slightly less precise. In addition, the mean relative error of estimations was below 18% for all geometric figures and all questionnaire forms, which can be considered a very favorable outcome expressing the strength of the pairwise comparisons method.

The design of the experiment didn’t allow to determine the cause of the path and scale dependency of the BWM, hence it can be a subject of a further research. Also, future research can focus on the path and scale dependency of other pairwise comparisons methods, such as the analytic hierarchy process (AHP), AHP-Express, or the base criterion method (BCM).