On the quality evaluation of scientific entities in Poland supported by consistencydriven pairwise comparisons method
Abstract
Comparison, rating, and ranking of alternative solutions, in case of multicriteria evaluations, have been an eternal focus of operations research and optimization theory. There exist numerous approaches at practical solving the multicriteria ranking problem. The recent focus of interest in this domain was the event of parametric evaluation of research entities in Poland. The principal methodology was based on pairwise comparisons. For each single comparison, four criteria have been used. One of the controversial points of the assumed approach was that the weights of these criteria were arbitrary. The main focus of this study is to put forward a theoretically justified way of extracting weights from the opinions of domain experts. Theoretical bases for the whole procedure are based on a survey and its experimental results. Discussion and comparison of the two resulting sets of weights and the computed inconsistency indicator are discussed.
Keywords
Pairwise comparisons Inconsistency analysis Expert opinion Academic entity quality Performance evaluationIntroduction and problem statement
The question of how to measure the performance of scientific entities is one of the most basic in the scientific community. The answer to this question is primarily related to: ’how should research funds be distributed among different research units?’ (addressed in Wang et al. 2011; Geuna and Martin 2003), or ’what should be the policy of the state in the promotion of science?’ (see Geuna et al. 1999) to name a few. Due to the many factors that can affect the final assessment, the problem of finding clear and widely acceptable performance indicators is not easy. Numerous legal environments and various scientific practices in different countries add to the problem complexity.
Comparison criteria for scientific entities
Code  Criterion name 

c _{1}  Scientific and/or creative achievements 
c _{2}  Scientific potentiality 
c _{3}  Tangible benefits of the scientific activity 
c _{4}  Intangible benefits of the scientific activity 
Each criterion is subdivided into many subcriteria that depend on the type of scientific entity. The ranking process seems to be easy, yet it is not. One of the important problems is to determine the significance of criteria \(c_{1},\ldots,c_{4}\). Relating c _{ i } to c _{ j } is both subjective and difficult due to the intangible and abstract nature of the criterion itself. In the adopted algorithm (“Research entities evaluation: The official procedure” Section), the criteria importance must be expressed as the real numbers. One of the ways allowing the subjective judgments to be transformed into the numerical values is the pairwise comparisons (PC) method. Therefore, to improve the algorithm proposed by the Ministry of Science and Higher Education (“Research entities evaluation: The official procedure” Section), the authors propose to add one additional step. The step in which the weights for criteria \(c_{1},\ldots,c_{4}\) are explicitly estimated by experts. The experiment conducted by the authors (“An experimental survey procedure” Section), the survey among the scientists, provides a sample of how the weights of the criteria \(c_{1},\ldots,c_{4}\) might look like, when they were determined by the PC method.
Preliminaries of the PC method
As indicated in “Introduction and problem statement” Section, the evaluation of research units and induction of the final linear ordering is based on four different criteria. These predefined criteria are shown in Table 1. Precise interpretation of these criteria is provided by the Ministry of Science and Higher Education (2012); here the intuitive understanding of them is sufficient.
Note that in view of Table 1 quality evaluation of research units is not only a Multicriteria Decision Problem (or, more precisely, Multicriteria Ranking Problem), but all the criteria are in fact of qualitative nature. Hence, the first step in the procedure consists of defining the transformation of nonmeasurable characteristics into a single numbers. This is done for each criterion of each research unit. For example, calculation of the value of criterion c _{1} consists in summing up points assigned to a list of publications of the last four years published by the research workers of the unit. The procedure is presented in detail in Ministry of Science and Higher Education (2012). Now, the problem can be approached by pairwise comparisons.
Let us briefly review the roots and ideas of the approach. It is believed that, in 1785, Condorcet was the first researcher who used pairwise comparisons for improving voting results in Condercet (1785). However, it was Fechner who described the PC method in in 1860 (reprinted in Fechner 1966), but he did it only from the psychometric perspective. Thurstone not only described the PC method in Thurstone (1994), but for the first time proposed a solution based on statistical analysis. In his seminal work Saaty (1977), Saaty introduced a hierarchy, which is instrumental for practical applications, and eigenvaluebased inconsistency.
Regretfully, the proposal of Saaty constitutes only a global inconsistency indicator and, as such, could not localize the most inconsistent elements of the matrix. The first ever localizing inconsistency definition was proposed in Koczkodaj (1993). Both inconsistencies were recently analyzed in Bozóki and Rapcsak (2008).
There are several different ways for deriving weights in the pairwise comparisons method Crawford (1987); Kułakowski (2013). For the purpose of this paper, the authors adopted probably the second most popular geometric means based method. The Monte Carlo study presented in Herman and Koczkodaj (1996) provided evidence that, for small inconsistencies, both the geometric means solution (used in this study) and the eigenvector solution (as proposed by Saaty in 1977) are similar enough from the statistical point of view. In fact, eigenvector and geometric means solutions are identical for fully consistent matrices and the geometric means is slightly better (approx. 7 out of 10 wins) than the principal eigenvector solution.
The procedure usually begins (after an appropriate feasibility study and data gathering, which are not addressed here) with a listing of all possible criteria. In our case, the four criteria mentioned in “Preliminaries of the PC method” Section are used.
A pairwise comparisons matrix M describing the relationship between n given alternative items is called reciprocal if \(m_{ij}=\frac{1}{m_{ji}}\) for every \(i,j=1,\ldots,n\) (then automatically m _{ ii } = 1 for every \(i=1,\ldots,n\)). Let we say that \(M=[m_{ij}]\in R^{n\times n}\) is a pairwise comparisons (PC) matrix if m _{ ij } > 0 for all i, j = 1, ..., n. A PC matrix M is called consistent (or transitive) if \(m_{ij}\cdot m_{jk}=m_{ik}\) for every \(i,j,k=1,\ldots,n\). Note that while every consistent matrix is reciprocal, the converse is false in general. Consistent matrices correspond to the ideal situation in which there are the exact values \(s_{1},\ldots,s_{n}\) for the entity. The elements of matrix M defined as quotients m _{ ij } = s _{ i }/s _{ j } form a consistent matrix. The vector \(\mathbf{s}=[s_{1},\ldots,s_{n}]\) is unique up to a multiplicative constant.
In the formulas above S _{ i } represents the rank of the ith alternative before normalization, and σ^{−1} is the normalization coefficient so that all s _{ i } = σ^{−1} S _{ i } for \(i=1,\ldots,n\), sum up to one.
The final assessment s is formed as the normalized geometric means of rows of \(\widehat{M}\) and is s = [0.201, 0.327, 0.184, 0.288]^{ T }. Hence, the winner is the second scientist with the rank 0.327, then respectively the scientist number four, one and three.
Data inconsistency and how to deal with it
Observe that matrix \(M^{\star}\) given by (5) is not consistent. For example \(m_{1,2}^{\star}\cdot m_{2,3}^{\star}\neq m_{1,3}^{\star}\) as \(0.4\cdot1.1\neq0.5\). The question arises what can one do about that?
where λ_{ max } is the principal eigenvalue of M. It is commonly assumed that the matrix M is sufficiently consistent if Ic(M) ≤ 0.1 Saaty (1977). In such a case the results calculated using e.g. the geometric means method are considered to be reliable.
where \(i,j,k=1,\ldots,n\) and i ≠ j∧ j ≠ k∧ i ≠ k. For sufficiently consistent matrices, it should not be too high.
When a matrix M is inconsistent (especially when the inconsistency is high), we must compute a consistent n × n PC matrix C which differs from the matrix M ’as little as possible’. This is a relatively simple and natural way of dealing with the problem. Note that the approximation is really reduced to a problem of norm selection and the distance minimization. For the Euclidean norm, the vector of geometric means (equal to the principal eigenvector for the transitive matrix) is the one which generates it.
Many approximation solutions have been proposed in the past starting with Jensen (1984). More recently, Bozóki et al. (2010), and others Anholcer et al. (2010); Grzybowski (2012) proposed a practical optimization. No study has ever provided an analytic proof of the substantial superiority of any method for approximation over another. Strong statistical evidence (based on 1,000,000 randomly generated matrices) suggests that both solutions (geometric means and the principal eigenvector) are reasonable and do not differ much for ’notsoinconsistent’ (NSI) matrices, as demonstrated in Herman and Koczkodaj (1996).
A further investigation of the selection of the norm (or distance) is beyond the scope of this study. In fact, it may require many years of research before any conclusions could be made and probably the pairwise comparisons may be helpful in it. Unfortunately, not much can be analytically proven for nontransitive matrices. In data processing, it is well expressed by the popular computer concept GIGO (Garbage In—Garbage Out). GIGO summarizes what is known for a long time: getting good results from ’dirty data’ is unrealistic and certainly cannot be guaranteed.
Research entities evaluation: the official procedure
The evaluation procedure officially adopted in Poland for assessment of research units consists of six steps. Some of them are more or less informal and based largely on the work of experts, whilst the other ones are precisely defined with extensive use of mathematical formulas. In particular the final results of the algorithm highly depends on subjectively defined weights \(W_{1},\ldots,W_{4}\) describing importance of each of the criteria \(c_{1},\ldots,c_{4}\), as presented in “Preliminaries of the PC method” Section.
Note that due to diversification of the research activities in different areas of science, all the units are divided into relatively small groups of similar entities (e.g. Faculties of Electrical Engineering). Hence, all the 963 units were divided into similarity groups (GWO) of a limited number of units (for example, around 50 in a typical GWO). The procedure was performed independently for each group.
Assessment procedure
 1.
At the beginning experts proposed weights \(W_{1},\ldots,W_{4}\) for each group of mutually comparable entities.
 2.
Then, each scientific entity X is assigned numerical values with respect of the four criteria as defined in (Table 1). As a result, a vector of four values \(O_{1}(X),\ldots,O_{4}(X)\) defining how good is unit X with respect to \(c_{1},\ldots,c_{4}\) is X is prepared.
 3.
The experts proposed two artificial entities A _{1} and A _{2} which will be used as reference units in order to assign every real research unit an appropriate funding level. A _{1} and A _{2} become part of a ranked group.
 4.All the entities are mutually compared within its GWO of comparable entities with respect to all four criteria (Table 1). The result of a single comparison of \(X,Y\in U\), where U is the GWO for X and Y, with respect to the ith criterion is given as:where$$ P_{i}(X,Y)=sgn(O_{i}(X)O_{i}(Y))\cdot \left\{ \begin{array}{ll} 0 & \hbox {if} \quad \Updelta O<D\\ \frac{\Updelta OD}{GD} & \hbox {if} \quad D\leq\Updelta O<G\\ 1 & \hbox {if} \quad G\leq\Updelta O\\ \end{array}\right. $$(9)$$ \Updelta O=\leftO_{i}(X)O_{i}(Y)\right $$(10)$$ D=max\left\{ \frac{min\left\{ O_{i}(X),O_{i}(Y)\right\} }{10},\frac{\sum_{Z\in U}O_{i}(Z)}{10\cdot card(U)}\right\} $$(11)$$ G=\max\left\{ \frac{3\cdot\min\left\{ O_{i}(X),O_{i}(Y)\right\} }{10},3\cdot D\right\} $$(12)
 5.During the currently adopted ranking procedure by the Ministry of Science and Higher Education (2012), the value V(X, Y) is computed according to the formula:where V(X, Y) is the total comparison score of the scientific unit X versus Y, W _{ i } is the rank (importance) of the ith criterion, and P _{ i }(X, Y) is the result of the pairwise comparisons between X and Y with respect to the ith criterion.$$ V(X,Y)=\underset{i=1,\ldots,4}{\sum}W_{i}P_{i}(X,Y) $$(13)
 6.The final rank of the scientific entity \(X\in U\) is computed as:where \(U=\{X_{1},\ldots,X_{card(U)}\}\) is the set of the scientific and the two reference (artificial) units to be assessed.$$ R(X)=\frac{1}{card(U)1}\left(\underset{Y\in U\backslash\{X\}}{\sum}V(X,Y)\right) $$(14)
Numerical example
Group of four mutually comparable scientific entities
Id.  Entity name  O _{1}  O _{2}  O _{3}  O _{4} 

1  X _{1}  51.97  525  12.64  84.5 
2  X _{2}  84.07  127  1.02  20 
3  X _{3}  41.11  583  4.22  88 
4  X _{4}  33.79  246  7.21  60 
5  A _{1}  39.07  455.40  12.12  59.76 
6  A _{2}  18.99  221.37  5.89  29.05 
The original weights as proposed for this comparisons group in the original algorithm are W _{1} = 0.65, W _{2} = 0.1, W _{3} = 0.15 and W _{4} = 0.1.
After consecutive repeating the procedure for every pair it can be calculated that V(X _{1}, X _{2}) = − 0.3, V(X _{1}, X _{4}) = 1, V(X _{1}, A _{1}) = 0.736 and V(X _{1},A _{2}) = 1. Thus, the final score for X _{1} is \(R(X_{1})=\frac{1}{5}(0.3+0.607+1+0.736+1)=0.609\). After calculating R for all other entities algorithm stops. The obtained rank is as follows: R(X _{1}) = 0.609, R(X _{2}) = 0.318, R(A _{1}) = 0.046, R(X _{3}) = 0.028, R(X _{4}) = − 0.21 and R(A _{2}) = − 0.791.
The final results of the evaluation procedure are given by the following linear ordering: X _{1}, X _{2}, A _{1}, X _{3}, X _{4}, A _{2}. Since there are two referential units, apart of the linear ordering the units are assigned to three categories: A—for the leading ones (here: X _{1},X _{2}, B—for the medium class (here: X _{3},X _{4}), and C—for ones which must improve (here: empty).
The need for a better method for the weight selection
Needless to say that the weights \(W_{1},\ldots,W_{4}\) (see step 5 of the procedure) have a significant influence on the final results of the ranking process. Their values determine what kind of achievements (and to what extent) are preferred. Hence, the choice of these weights determines the required policy of the development of scientific entities in Poland (the rank position translates into an appropriate funding level).
Recall that these weights were defined by experts in an arbitrary way. Due to the significance of their values, we propose to adopt the selection procedure by computing values of the weighting coefficients from their pairwise comparisons.
There are several reasons to this approach be considered acceptable. One of them is intangibility of the compared achievement assigned to each of the evaluation criteria. Tangible things can be easily measured with reference to some specific unit. Thus, the measure determines the levels of desired features. The intangible factors can be compared in pairs Saaty (2013) without an a priori measurement. In the domain literature, there is considerable evidence indicating that the pairwise comparisons method works when the intangible objects need to be compared Subramanian and Ramanathan (2012).
The algorithm criteria as mentioned in (Table 1) reflect intangible achievements. Experts need a method that would allow them to assess all the objects. The pairwise comparisons method simplifies it by reducing the comapred objects to only two at a time.
Note that the weights tuning mechanism is used also in AHP (Analytic hierarchy process)—another decision making scheme based on the comparing objects in pairs Saaty (1977). In the context of the AHP method such weights are often referred to as the criteria with respect to the goal evaluation. Of course, the AHP uses the weights in a bit different way. However, the regularity that, the higher the weight of the criterion is, the greater is its impact on the final result, is preserved.
Selection of such important factors as weights in the scientific entity evaluation procedure should be based on transparent, well justified mechanisms. The results should gain a widespread acceptance among the members of the evaluated units. Here again, the pairwise comparisons method can be helpful.
As it is shown in the experiment (see “An experimental survey procedure” Section in the evaluation process may attend any number of experts from different research centers. The pairwise comparisons method also addresses the inconsistency problem. It allows the experts to measure Saaty (1977), to localize Koczkodaj (1993) and to reduce Koczkodaj and Szarek (2010) the inconsistency of the results of comparisons in pairs.
An experimental survey procedure
As with any anticipated change, the proposed CERU ^{5} conceptual model of the assessment process was vigorously debated in the scientific community (see Kistryn 2013). In particular, the weights \(W_{1},\ldots,W_{4}\) corresponding to the importance of the criteria \(c_{1},\ldots,c_{4}\) (Table 1) were the subject of debate and criticism since they were established in an arbitrary way.

the PC method is the core of the experiment; so the values are better justified,

any arbitrarily large number of experts can express their preferences,

the expert judgment consistency should be evaluated and kept at a possible low level.
Comparison scale—the values 1,2 and 3 are assigned to the appropriate definitions of intensity or importance. Intermediate judgments are also possible. E.g. value 1.4 corresponds to the situation when one criterion is slightly more preferred than the other
Value  Definition of intensity or importance  Explanation 

1  Equal importance  Two criteria equally contribute to the objective 
2  Essential or strong importance  Experience and judgments favor one criterion over another 
3  Absolute importance  The highest affirmation degree of favoring one criterion over another 
1.4  Intermediate judgments  An expert prefers slightly one criterion over another 
The choice of scale is also a challenging problem. It has been extensively discussed in the literature Fülöp et al. (2010); Dong et al (2008); Ji and Jiang (2003); Salo and Hämäläinen (1997); Triantaphyllou et al. (1994). There is no “one fits all” scale, although some studies argue that a certain scale should give more reliable results than another. In Fülöp et al. (2010), a small scale from 1 to 3 (Table 3) is shown to have the best mathematical properties (relted to the convexity)for the PC method. It was adopted by the authors for this study. After all, practically all modern languages have only three levels of gradation in the grammar (e.g., good > better > the best).
Despite the scale recommendation the Internet survey application allowed respondents to set any value (except 0) of the m _{ ij } ratio between \(\frac{1}{99}\) and 99. Introduction (suggestion) the scale while allowing almost the free choice of ratio is an attempt to find a compromise between the desire to give an intuitive interpretation for some numerical values (the scale), and allowing the experts to the greatest possible precision in expressing beliefs. Moreover, thanks to introducing the scale all the experts share the same correspondence between the numerical values and the intuitive descriptions of importance. This helps to minimize the risk of situation in which two experts sharing the belief that c _{ i } is absolutely more important than c _{ j } assign two essentially different (although greater than one) values of m _{ ij }. The scale introduction allows for identification all such cases, and excluding the identified outliers from the ranking. The candidates for outliers are experts whose answers are significantly off the scale. Usually their response also has a large inconsistency.
The meaning of the adopted scale is quite intuitive. For example, if an expert assigns W _{ i }/W _{ j } to 1, this means that the criteria i and j are of equal importance. On the other hand if, for instance, 2 < W _{ i }/W _{ j } < 3 then, according to the adopted textual interpretation (Table 3), the ith criterion was recognized as essentially more important than the jth one.
In the ideal case, there should be always \({W_{i}/W_{j}}\cdot{W_{j}/W_{k}}={W_{i}/W_{k}}\). However, because each of the three ratios are determined independently, in practice this is often not the case. Hence, very often there are some triads of ratios which do not meet this equality. This situation is related to the problem of data inconsistency in the PC matrix, which is discussed more thoroughly in “Data inconsistency and how to deal with it” Section.
Survey results
Survey data
The survey involved 37 researchers from 17 Polish and foreign scientific institutions engaged in research in the field of technical and engineering sciences. Most of them are tenured faculty members at Universities in Poland, USA, Canada, and Australia although some of them declared employment in research institutes. The vast majority of respondents declared the position of a full professor or equivalent.^{6} A few persons held the prestigious title of distinguished professor.
To synthesize the final results the authors used almost all the gathered matrices M _{ r }. The only exceptions were five result sets with the very high inconsistency index \({\fancyscript{K}(M_{r})}\) (over 0.836), and the inconsistency index Ic(M _{ r }) higher than 0.1. Although all the rejected cases differ in detail, most of the rejected authors indicated very significant importance of the first criterion (scientific and/or creative achievements) over other arbitrarily chosen criteria. Unfortunately, due to the large inconsistency (in the literature Ic(M _{ r }) higher than 0.1 is considered as unacceptable Saaty 2005) their opinions have not been taken into account ^{7} in the synthesized matrix \(\widehat{M}\).
Results: different perspectives
Montecarlo discrepancy validation
As a validation method for the survey data the authors adopt ten times repeated twofold crossvalidation procedure Kohavi (1995). In every repetition the survey sample is randomly split into two disjoint sets \(S_{1}=\{M_{1},\ldots,M_{16}\}\) and \(S_{2}=\{M_{17},\ldots,M_{32}\}. \) Both groups are used to synthesize matrices \(\widehat{M}_{1}\) and \(\widehat{M}_{2}\), next two ranking vectors \(a=\left[a_{1},\ldots,a_{4}\right]^{T}\) and \(b=\left[b_{1},\ldots,b_{4}\right]^{T}\) are computed. The vector a is called the reference rank vector, whilst b is called the validation rank vector. For each pair of vectors a and b the discrepancy vector \(d=\left[\lefta_{1}b_{1}\right,\ldots,\lefta_{4}b_{4}\right\right]^{T}\) is computed.
It is easy to see, that the obtained rank result is (on average) similar to the overall result of the survey (Eq. 18). In particular both vectors a _{ avg } (Eq. 23) and ω (Eq. 18) propose the same order of criteria importance. Their individual numerical values are also close to each other. The absolute average absolute difference between individual values in vectors a ^{(i)} and b ^{(i)} seem to be reasonably small since they are almost an order of magnitude less than the values in a ^{avg}. They suggest that regardless of the selection of the group criterion c _{1} should be the most important one a _{1} ^{avg} − d _{1} ^{avg} > a _{ i } ^{avg} + d _{ i } ^{avg} . Unfortunately there is no similar guarantee in the case of any other criterion. The values \(a_{1}^{{\rm avg}}\pm d_{1}^{{\rm avg}},\ldots,a_{4}^{{\rm avg}}\pm d_{4}^{{\rm avg}}\) indicate the discrepancy intervals in which the weights of criteria \(c_{1},\ldots,c_{4}\) established by the competitive team of experts are expected to be found.
The adopted Montecarlo discrepancy validation procedure tries to model a realistic situation in which one group of experts provides one rank, whilst the other group (disjoint with the first one) creates another rank. Both groups call into question the results of its opponent. As demonstrated by the tests carried out when both groups are composed of experts with a similar scientific background the discrepancies might not be to high.
Also the further research on the inconsistency of synthesized PC matrix \(\widehat{M}\) seem to be interesting. In particular the relationship between the values of inconsistency indices \(Ic(\widehat{M})\) and \({\fancyscript{K}(\widehat{M})}\) and the deviations of the individual expert judgements in matrices \(M_{1},\ldots,M_{r}\) need better explanation.
Discussion
The survey concerned the basic scientific units at universities in the field of technical and engineering sciences. Thus, the gathered results do not apply to social sciences or the arts. The surveyed researchers have made six comparisons between the four criteria \(c_{1},\ldots c_{4}\) (Table 1). They could almost freely choose between ratios from \(\frac{1}{99}\) to 99, thus indicating which criterion is more (and how much) important. However, a small scale was recommended following the theory proved in Fülöp et al (2010).
Comparing the survey results (Fig. 1) with the weights adopted in the official government regulation Ministry of Science and Higher Education (2012) (they are: c _{1}—0.65, c _{2}—0.1, c _{3}—0.15, and c _{4}—0.1) it should be noted that they differ in the intensity of preferences, although they tend to be similar with regard to the order of preferences. In both rankings, the criterion designated as most important is c _{1} and the second most important criterion is c _{3}. However, the weight of c _{1} resulting from the survey is almost two times less than the one assumed in the regulation. On the other hand, c _{2} obtained from the survey is a bit higher than the one adopted in the official document. According to the survey, the criterion c _{2} is slightly less important than c _{3} but more important than c _{4}, whilst the regulation assumes that the weights of c _{2} and c _{4} are the same. In both these cases the weights obtained from the survey are higher than the ones adopted in the regulation.
The regulation retains the dominant criterion c _{1}, whilst the other criteria are less important. In fact, it is enough for the scientific entity to be strong in c _{1} to avoid having to worry about the other criteria. The survey participants were in favor of a more balanced model in which c _{1} is still the most important criterion, but is not predominant. They also appreciate the importance of other criteria with particular emphasis to c _{3} (tangible benefits of the scientific activity). Hence, in the model proposed by the surveyed researchers the predominant position of only one criterion c _{1} has been replaced by the predominant position of the pair (c _{1},c _{ i }), where c _{ i } is any other criterion out of c _{2}, c _{3}, c _{4} (please note that the rank of c _{1} and the rank of any other criterion is more than 0.5). Therefore, based on the survey results, such a model would be recommended in which the evaluated scientific entity is good in terms of c _{1} but is also good in terms of at least one other criterion, c _{2}, c _{3} or c _{4}. Of course, the appropriate selection of weights will not solve all the problems related to the scientific entity evaluation algorithm. In particular, it does not prevent the “displacement” of good results in the most important category c _{1} by outstanding (in the number but not in the quality nor originality of achievements) results in the less important categories.
The present work tackles many problems and can be a starting point for further research in various areas. In particular, although the new algorithm weights deriving in the official scientific units evaluation procedure is proposed, there are also other highly subjective parts of the algorithm where the PC methods might help. One of them is choosing by experts the socalled reference scientific units.
Conclusions
The identification of major criteria is a key issue for building a conceptual evaluation model. Once it is done, the final weights are computed from the relative pairwise comparisons by synthesizing them. The model demonstrated in this paper has been used in Poland for evaluating scientific entities consistent with the one proposed by the Ministry of Science and Higher Education (2012). However, the presented method is flexible and can accommodate all criteria at hand, including both quantitative and qualitative factors. No model is ideal and usually undergoes evolution as time passes. It is anticipated that CERU will be improving the model to evaluate academic entities at the national level. Using our approach to compute the weights is a time consuming but necessary exercise since it will benefit the entire country when the weights are computed (as opposed to arbitrary assignment). In particular, the successindex could improve the performance evaluation methods Franceschini et al (2012) in the evolved model.
Footnotes
 1.
Following the official regulation, the authors understand that the scientific entity (sometimes also referred to as the scientific unit) means: a research unit within the university such as faculty or department, an independent research institute (national and international), and research units within The Polish Academy of Science.
 2.
 3.
Please compare with Theorem 4 in Saaty (2008), where the importance weight assigned to every judge is identical.
 4.
Data used are real and taken from http://www.nauka.gov.pl. For the purposes of example the number of entities in the group SI1EA has been reduced from 44 to arbitrary selected 4.
 5.
A Polish Committee for Evaluation of Research Units; Polish acronym is KEJN
 6.
In Poland there are professor extraordinarius and professor ordinarius.
 7.
On the other hand, even if two rejected cases were taken into account their impact on the final result would be negligible.
Notes
Acknowledgements
The following faculty members have kindly provided their input and agreed to be listed (according to the Polish tradition with the scientific titles; the country is listed if outside Poland): Prof. dr hab inż. Ryszard Tadeusiewicz (AGH UST), Prof. dr hab. Stanisław Kistryn (The Jagiellonian University), Prof. dr Bogdan Denny Czejdo, Belk Distinguished Professor (Fayetteville State University, USA), Prof. dr hab. Stan Matwin (IPI PAN, Poland, Dalhousie University, Canada), Prof. Eugene Eberbach (Rensselaer Polytechnic Institute, USA), dr hab. Krzysztof Oprzędkiewicz, Prof. AGH (AGH UST), Prof. dr hab. Maria MachKról (University of Economics in Katowice), Prof. dr hab. Halina Kwaśnicka (Wroclaw University of Technology) and Prof. Witold Kwaśnicki (Administration and Economics University of Wroclaw), dr inż. Jarosław Wąs (AGH UST), dr inż. Radosław Klimek (AGH UST), dr inż. Paweł Skrzyński (AGH UST), mgr inż. Krzysztof Kluza (AGH UST), Weronika Adrian (AGH UST). The authors would like to thank all respondents of the Internet survey. The authors would like to thank T. Kakiashvli, MD, Amanda Dion—Groleau (Laurentian University), Grant O. Duncan (student at Laurentian University; Team Lead at Health Sciences North, Sudbury, Ontario, Canada), and Mr Ian Corkill for the editorial improvements. Mr Karol Wójcik (student at AGH UST) has developed and maintained the Internet surveying tool.
References
 Aczél, J., & Saaty, T. L. (1983). Procedures for synthesizing ratio judgements. Journal of Mathematical Psychology 27(1):93–102. doi: 10.1016/00222496(83)900287.CrossRefzbMATHMathSciNetGoogle Scholar
 Anholcer, M., Babiy, V., Bozóki, S., & Koczkodaj, W. W. (2010). A simplified implementation of the least squares solution for pairwise comparisons matrices. Central European Journal of Operations Research 19(4):439–444.Google Scholar
 Bozóki, S., & Rapcsak, T. (2008). On Saaty’s and Koczkodaj’s inconsistencies of pairwise comparison matrices. Journal of Global Optimization 42(2):157–175.CrossRefzbMATHMathSciNetGoogle Scholar
 Bozóki, S., Fülöp, J., & Rónyai, L. (2010). On optimal completion of incomplete pairwise comparison matrices. Mathematical and Computer Modelling 52(1–2):318 – 333, doi: 10.1016/j.mcm.2010.02.047, URL http://www.sciencedirect.com/science/article/pii/S0895717710001159.
 Condercet, M. (1785). Essay on the Application of Analysis to the Probability of Majority Decisions. Paris:Imprimerie Royale.Google Scholar
 Crawford, G. B. (1987). The geometric mean procedure for estimating the scale of a judgement matrix. Mathematical Modelling 9(3–5):327 – 334 doi: 10.1016/02700255(87)904891, URL http://www.sciencedirect.com/science/article/pii/0270025587904891.
 Dong, Y., Xu, Y., Li, H., & Dai, M. (2008). A comparative study of the numerical scales and the prioritization methods in AHP. European Journal of Operational Research 186(1):229–242.CrossRefzbMATHMathSciNetGoogle Scholar
 Fechner, G. T. (1966). Elements of psychophysics, vol 1. Holt, Rinehart and Winston, New York.Google Scholar
 Franceschini, F., Maisano, D., & Mastrogiacomo, L. (2012). Evaluating research institutions: The potential of the successindex. Scientometrics 96(1):85–101.CrossRefGoogle Scholar
 Fülöp, J., Koczkodaj, W. W., & Szarek, S. J. (2010). A different perspective on a scale for pairwise comparisons. Transactions on Computational Collective Intelligence 1:71–84.CrossRefGoogle Scholar
 Geuna, A., & Martin, B. R. (2003). University research evaluation and funding: An international comparison. Minerva 41(4):277–304.CrossRefGoogle Scholar
 Geuna, A., of Sussex SPRU : Science U, Research TP (1999). The Changing Rationale for European University Research Funding: Are There Negative Unintended Consequences? Electronic working paper series, University of Sussex, SPRU, URL http://books.google.pl/books?id=lBpuMwEACAAJ.
 Grzybowski, A. Z. (2012). Note on a new optimization based approach for estimating priority weights and related consistency index. Expert Systems with Applications 39(14):11,699–11,708.CrossRefGoogle Scholar
 Herman, M. W., & Koczkodaj, W. W. (1996). A monte carlo study of pairwise comparison. Inf Process Lett 57(1):25–29 doi: 10.1016/00200190(95)001859.CrossRefzbMATHGoogle Scholar
 Jensen, R. E. (1984). An alternative scaling method for priorities in hierarchical structures. Journal of Mathematical Psychology 28(3):317 – 332. doi: 10.1016/00222496(84)900038, URL http://www.sciencedirect.com/science/article/pii/0022249684900038.Google Scholar
 Ji, P., & Jiang, R. (2003). Scale transitivity in the AHP. Journal of the Operational Research Society 54(8):896–905 doi: 10.1057/palgrave.jors.2601557.CrossRefzbMATHGoogle Scholar
 Kistryn, S. (2013). Mission of CEAE – how easy is to evaluate the quality? (Misja KEJN – czy łatwo ocenić jakość?). URL http://forumakademickie.pl/fa/2012/05/misjakejnczylatwoocenicjakosc/.
 Koczkodaj, W. W. (1993). A new definition of consistency of pairwise comparisons. Math Comput Model 18(7):79–84. doi: 10.1016/08957177(93)900598.CrossRefzbMATHGoogle Scholar
 Koczkodaj, W. W., & Szarek, S. J. (2010). On distancebased inconsistency reduction algorithms for pairwise comparisons. Logic Journal of the IGPL 18(6):859–869.CrossRefzbMATHMathSciNetGoogle Scholar
 Kohavi, R. (1995). A study of crossvalidation and bootstrap for accuracy estimation and model selection. San Mateo: Morgan Kaufmann. pp 1137–1143.Google Scholar
 Kułakowski, K. (2013). A heuristic rating estimation algorithm for the pairwise comparisons method. Central European Journal of Operations Research pp 1–17, doi: 10.1007/s101000130311x.
 Ministry of Science and Higher Education. (2012). Regulation on principles of science financing (Polish: Rozporządzenie Ministra Nauki i Szkolnictwa Wyższego w sprawie kryteriów i trybu przyznawania kategorii naukowej jednostkom naukowym). Dziennik Ustaw Rzeczypospolitej Polskiej 877, URL http://www.bip.nauka.gov.pl/_gAllery/19/31/19319/poz._877.pdf.
 Saaty, T. L. (1977). A scaling method for priorities in hierarchical structures. Journal of Mathematical Psychology 15(3):234 – 281, doi: 10.1016/00222496(77)900335, URL http://www.sciencedirect.com/science/article/pii/0022249677900335.Google Scholar
 Saaty, T. L. (2005). The analytic hierarchy and analytic network processes for the measurement of intangible criteria and for decisionmaking. In: Multiple Criteria Decision Analysis: State of the Art Surveys, International Series in Operations Research and Management Science, vol 78, Springer New York, pp 345–405. doi: 10.1007/0387230815_9.
 Saaty, T. L. (2008) Relative Measurement and Its Generalization in Decision Making. Why Pairwise Comparisons are Central in Mathematics for the Measurement of Intangible Factors. The Analytic Hierarchy/Network Process. Estadística e Investigación Operativa / Statistics and Operations Research (RACSAM) 102:251–318.zbMATHMathSciNetGoogle Scholar
 Saaty, T. L. (2013). On the measurement of intangibles. A principal eigenvector approach to relative measurement derived from paired comparisons. Notices of the American Mathematical Society 60(02):192.zbMATHMathSciNetGoogle Scholar
 Salo, A. A., & Hämäläinen, R. P. (1997). On the measurement of preferences in the analytic hierarchy process. Journal of MultiCriteria Decision Analysis 6(6):309–319. doi: 10.1002/(SICI)10991360(199711)6:6<309::AIDMCDA163>3.0.CO;22.CrossRefzbMATHGoogle Scholar
 Subramanian, N., & Ramanathan, R. (2012). A review of applications of analytic hierarchy process in operations management. International Journal of Production Economics 138(2):215–241.CrossRefGoogle Scholar
 Thurstone, L. L. (1994). A law of comparative judgment, reprint of an original work published in 1927. Psychological Review 101:266–270.CrossRefGoogle Scholar
 Triantaphyllou, E., Lootsma, F. A., Pardalos, P. M., & Mann, S. H. (1994). On the evaluation and application of different scales for quantifying pairwise comparisons in fuzzy sets. Journal of MultiCriteria Decision Analysis 3(3):133–155.CrossRefzbMATHGoogle Scholar
 Wang, X., Liu, D., Ding, K., & Wang, X. (2011). Science funding and research output: A study on 10 countries. Scientometrics 91(2):591–599.CrossRefGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.