Constrained Dual Scaling for Detecting Response Styles in Categorical Data

Dual scaling (DS) is a multivariate exploratory method equivalent to correspondence analysis when analysing contingency tables. However, for the analysis of rating data, different proposals appear in the DS and correspondence analysis literature. It is shown here that a peculiarity of the DS method can be exploited to detect differences in response styles. Response styles occur when respondents use rating scales differently for reasons not related to the questions, often biasing results. A spline-based constrained version of DS is devised which can detect the presence of four prominent types of response styles, and is extended to allow for multiple response styles. An alternating nonnegative least squares algorithm is devised for estimating the parameters. The new method is appraised both by simulation studies and an empirical application.


Introduction
A major issue in questionnaire-based research is the presence of response styles. A response style, sometimes also known as response bias or scale usage heterogeneity, can be described as systematic bias due to a respondent's tendency to respond to survey items regardless of its content (Van Rosmalen, Van Herk, & Groenen, 2010). Paraphrasing, a response style is the manner in which a person uses a rating scale, an example being extreme response style where the respondent, for no substantial reason, prefers to use the endpoints of the Likert scale more often than the intermediate rating categories.
Response styles can invalidate statistical analyses since they are completely confounded with the substantial information contained in the data and hence biases results in nontrivial ways (Baumgartner & Steenkamp, 2001). The problem manifests itself when different respondents resort to different response styles within the same data set. Advanced methods, such as the latent-class multinomial logit model of Van Rosmalen et al. (2010), the multidimensional ordinal IRT model of De Jong and Steenkamp (2010), or the ordinal regression model with heterogeneous thresholds of Johnson (2003), have been developed to deal with the data analysis when response style contamination is relevant. None of these appear to have achieved much popularity in practice.
Existing models often require a substantial investment of resources for its implementation, estimation and/or interpretation. As an alternative, the method presented in this paper results in a data set cleaned of the effects of response styles so that any analyses appropriate for the continuous nature of this cleaned data can be conducted. Furthermore, this method has three additional purposes, namely to (i) determine whether different response styles are present in categorical data; (ii) identify the respondents associated with each response style; and to (iii) classify the identified response styles into four different types. Software which implements 969 the method in the R software environment (R Core Team, 2014) is available from the first author.
The proposed method is a variant of dual scaling (DS) for rating data (Nishisato, 1980a), also referred to as successive categories data in the DS literature. DS is an exploratory multivariate method, akin to correspondence analysis or CA (e.g. Greenacre, 2007). In the special case of rating data, DS however differs from CA in a manner that implicitly caters for response styles by including parameters for the Likert scale categories in an innovative way. These parameters allow for the detection of frequent (or infrequent) usage of certain ratings since the optimal scores assigned by DS to these parameters depend on how often each rating occurs in the data. The new method builds on this aspect of DS by including monotone spline functions to model the response styles and by allowing for multiple response styles through latent classes.
The literature on response styles (also known as scale-usage bias or heterogeneity) can be traced back at least to the work of Cronbach in the 1940s (e.g. Cronbach, 1941Cronbach, , 1942Cronbach, , 1946Cronbach, , 1950. For an overview of the early work, see for example Rorer (1965). A more recent set of references can be found in Baumgartner and Steenkamp (2001). Krosnick (1999) discusses the origins of response styles as a shift in the procedure whereby a response is formulated; this is also known as satisficing in the literature (e.g. Krosnick, 1991). The use of so-called personal equations with double coding, as known in the French school of CA, is a related method of dealing with differences in the interpretation of rating scales at the respondent level (e.g. Benzécri, 1992;Murtagh, 2005).
The next section focuses on a closer discussion of response styles. Section 3 introduces spline functions for modelling response styles, explains the new methodology and details an alternating least squares (ALS) algorithm for solving an extended version of the DS problem. A simulation study is conducted in Section 4 to assess the strengths and weaknesses of the method. Finally, an application (Section 5) is presented.

Overview of Response Styles
It is assumed that the process of formulating a response to a survey item requires the respondent to map a latent opinion, preference or some similar concept to a Likert scale. For example, the respondent may be asked how much she agrees with a certain statement using a scale with categories ranging from "1-Totally Disagree" to "5-Totally Agree." During the cognitive process of formulating the answer, the respondent first forms an opinion about the survey item and subsequently needs to decide how to transform or map this opinion to the presented rating scale (see for example Weijters and Baumgartner (2012)). The mathematical properties of this response mapping from the latent to the Likert scale determines whether a respondent exhibits a response style or not.
Specifically, a response style can be defined as a monotone nonlinear response mapping (Van de Velden, 2007). If this transformation is linear, no response style is present. Consequently, once a method is available to estimate response mappings, the presence of response styles can be assessed by looking at the curvature properties of the estimated mappings. These steps are carried out in subsequent sections. In the case where Likert scales are used these transformations are step functions, but for the moment it is more intuitive to consider continuous transformations.
Four different response styles are considered here, as depicted in Figure 1. This figure shows different possible inverse mappings from the rating supplied by the respondent on the horizontal axis to the respondent's true latent opinion on the vertical axis. The inverse transformations are shown since these must be estimated from the observed data. Examples of (inverse) response style functions mapping the true item content scale (vertical axis) into the observed measurement scale (horizontal axis).
The different styles can be characterized by which parts of the latent opinion scale are stretched and which parts are shrunk. These are shown by the rug plots on the respected axes in Figure 1. For ease of exposition, it is assumed here that the true latent opinion comes from a uniform distribution. The rug on the horizontal axis partitions the axis into intervals of equal length, with each interval receiving a rating on the Likert scale. Here a seven-point scale is employed. The rug on the vertical axis shows the effect that the response style transformation has on the intervals of equal length. Hence these transformations characterize the following four response styles: • Acquiescence (ARS) shrinks the lower part of the latent scale and stretches the upper part indicating that higher ratings are favoured (panel (a)); • Disacquiescence (DRS) in contrast favours lower ratings by stretching and shrinking the lower and upper parts of the latent scale, respectively (panel (b)); • Midpoint responding (MRS) reflects a tendency to frequent the middle categories of the rating scale (panel (c)); and • Extreme responding (ERS) in contrast means that the endpoints of the rating scale is used more often than the middle categories (panel (d)).
A critical concept is that the boundaries dividing the latent preference scale into the different rating categories, that is the tick marks on the vertical axes in Figure 1, determines which response style is present. If these boundaries are equally spaced, no response style is present. Any significant deviations however give a cause to believe that a response style is present.
The methodology outlined in the next section makes use of these boundaries to provide an estimate of the response mappings of groups of individuals.

Methodology
Consider the situation where a set of m objects or survey items are being rated on a q-point Likert scale, enumerated as 1 to q. Due to the ordinality, this is sometimes known as successive categories data (Nishisato, 1980b(Nishisato, , 1994. It is supposed that n individuals are asked to rate the objects according to their preference. Objects may receive equal ratings, and it is assumed that there exists a fixed but unknown preference structure for the set of objects, such as a population mean. Let X denote the n × m data matrix. Note that the method detailed below requires all items to use a common rating scale. The next subsection discusses using DS for analysing successive categories data in general, making use of the method's relationship with correspondence analysis. Monotone quadratic splines for modelling response styles are introduced in Section 3.2. Subsequently the DS method is modified to utilize these splines together with latent classes to allow for multiple response styles. An alternating nonnegative least squares algorithm is described for fitting the model in Section 3.4. Selecting the number of latent response style groups (Section 3.5) and creating a data set purged of the effects of response styles (Section 3.6) are also discussed.

Dual Scaling of Successive Categories Data
Dual scaling is a multivariate exploratory statistical technique which is equivalent to correspondence analysis (CA) when analysing contingency tables (Van de Velden, 2000). For such cases, it is used to visualize departures from the independence assumption in the two-way contingency table in a low dimensional space, akin to principal components analysis (PCA) for continuous data (Nishisato, 1980a;Greenacre, 2007). However, for the successive categories data dealt with here there are important differences.
Both DS and CA deal with non-contingency table data by typically applying the standard procedure to a specific recoding of the data, designed to transform the data into a form that resembles a contingency table (Greenacre, 2007). This recoding requires the original data matrix X to be transformed before analysis, and for successive categories data in particular the recoding schemes differ in an important way. The usual CA method uses a doubling of columns (that is, adding an additional column to X for each object) to construct scales with "positive" and "negative" poles before applying ordinary CA (see Greenacre, 2007). However, Nishisato (1980b) proposes the following alternative method. This involves augmenting rating scale category thresholds or boundaries to the data, which increases the number of columns from m to m + q − 1, and then converting this to rank-orders. Although Nishisato's original DS formulation focuses on a so-called dominance matrix (see Nishisato, 1980a), it has been shown that DS applied to these rank-orders are equivalent to doubling the rows (instead of the columns) of the matrix of rankings before applying CA ( Van de Velden, 2000;Torres & Greenacre, 2002).
The method is perhaps best illustrated by an example. Consider transforming the following data matrix X, where three objects A, B and C are rated by n = 4 respondents on a 5-point Likert scale (thus, q = 5). The first step requires augmenting 4 (= q − 1) columns to X, one column for each of the boundaries between the pairs of adjacent ratings. Let the boundaries be called b 1 , . . . , b 4 , where b 1 falls between ratings 1 and 2, and so forth up to b 4 which falls between categories 4 and 5. It suffices to assign scores midway between the rating categories to each boundary, to arrive at the augmented data matrix: 3 1 1.5 2.5 3.5 4.5 2 2 5 1.5 2.5 3.5 4.5 3 2 2 1.5 2.5 3.5 4.5 1 5 4 1.5 2.5 3.5 4.5 Secondly, each row is converted to rankings, starting with a lowest rank of 0 and a highest rank of 6 (= m + q − 2) in this case. For ties the total ranking assigned to the tied objects is distributed equally. This yields the following result for the example: 3 0 1 2 4 6 1.5 1.5 6 0 3 4 5 4 1.5 1.5 0 3 5 6 0 6 4 1 2 3 5 ⎞ ⎟ ⎟ ⎠ . (2) Note that in general T has n rows and m + q − 1 columns. DS also requires construction of the matrix S that would have resulted if q was the lowest and not the highest rating of the Likert scale. This is easily achieved as Using the CA formulation of DS of Van de Velden (2000), a row-doubled ratings matrix F r : 2n × (m + q − 1) is constructed as This matrix is subjected to CA, which assigns optimal scores in the vectors a and b to the rows and columns of F r , respectively. Since the aim is to assign to the boundaries ordered scores which are sensitive to rating scale use, a one-dimensional solution is used. This assignment is achieved by minimizing a least squares criterion L(a, b) through the singular value decomposition (Van de Velden, Groenen, & Poblome, 2009). In the present context L is given by where c is a proportionality constant, 1 denotes vectors of ones of the appropriate lengths and 1 2 (m + q − 2)11 centres the rankings in F r . For identifiability, a constraint such as a = 1 is imposed. The method is discussed in more detail in Section 3.3. Note that an important consequence of the data recoding scheme is that the DS procedure provides coordinates for the boundaries. The effect of the boundaries is to retain the information on how different the original ratings assigned to the objects were before the rankings were constructed. The coding scheme also imposes ordinality on the object and the boundary scores in b by constructing rankings.
The optimal scores assigned to the boundaries can be used to detect response styles since they estimate the thresholds of the response mapping of the group of respondents, as was discussed in relation to Figure 1. Intuitively optimal scores assigned to the boundaries work as follows. If a specific rating category j is used very often, the boundaries b j−1 and b j will often receive rankings which differ substantially since the category is often filled. Consequently, the optimal scores assigned will differ significantly, indicating that respondents use the category very often. The same reasoning illustrates that when rating j is used very infrequently, the optimal scores for b j−1 and b j will be very similar. Therefore, when a group of respondents have the same response mapping, the method will be able to tell which type that mapping is.
In Section 3.3, latent classes will be introduced for the boundary scores which allows for multiple response styles. First, however, using monotone quadratic splines with the dual scaling method is discussed.

Modelling Response Styles by Monotone Quadratic Splines
From Figure 1, it is evident that the four response styles considered can be completely described in terms of its curvature properties. By dividing the horizontal axes into two equal lower and upper parts, the four response styles are characterized by either concavity or convexity in the lower and upper parts of its domain. This is summarized in Table 1. For inferential and response style classification purposes, it will prove useful to parameterize the response style transformations considered here. Furthermore, using smooth functions will improve model parsimony and the stability of parameter estimation, as well as facilitate the process of purging the response styles from the data by interpolation (see Section 3.6). The family of monotone quadratic splines with a single interior knot is ideal for this purpose as it combines two quadratic polynomial functions in the adjacent intervals of the domain, subject to continuity and differentiability restrictions at the interior knot. These splines are either concave, convex or linear in the lower and upper halves of the domain and therefore reproduce all the curves described in Figure 1 and Table 1.
The splines have three non-constant basis functions (the so-called I-spline basis) derived by appropriately integrating the basis functions of the M-spline basis (see Ramsay, 1988). A quadratic monotone spline with a single interior knot t ∈ [L , U ] and intercept μ is of the form In the proposed model, t = L + 0.5(U − L) is chosen to lie halfway between the lower and upper limits L and U , respectively. Monotonicity requires that α i ≥ 0 for i = 1, 2, 3. The basis functions M 1 , M 2 and M 3 are constructed to ensure continuity and first-order differentiability at t, and their formulae are as follows (Ramsay, 1988): Hence (6) is simply a linear combination of these three piecewise quadratic functions with an intercept. The spline functions are built into the column scores b in (5) by using the (q − 1) × 4 design matrix M to collect the basis functions corresponding to μ, α 1 , α 2 and α 3 , respectively. The basis 974 PSYCHOMETRIKA Figure 2.
The three I-spline basis functions for quadratic monotone splines with a single interior knot t.
functions are evaluated at the midpoints between rating categories, for example at 1.5, 2.5 up to 6.5 for a 7-point Likert scale. Hence b can be written as with b 1 the m-vector of unrestricted object scores and b 2 the (q − 1)-vector of spline-restricted boundary scores. The spline parameters are collected in α = (μ, α 1 , α 2 , α 3 ) . The basis functions M 1 , M 2 and M 3 in (7), as depicted in Figure 2, are piecewise quadratic, with only two of them nonconstant in each of the intervals [L , t) and [t, U ]. This is convenient because it means the second derivative of f , and hence the curvature, depends only on two parameters in each interval. Rescaling without loss of generality so that L = 0 and U = 1, the curvature of f (not necessarily defined at t = 1/2) is given by The function f (x) is either convex, concave or linear in a given interval depending on whether the second derivative (9) is positive, negative or zero, respectively, which does not depend on x. In fact, assuming that α 1 and α 3 are larger than zero, the curvature can be measured solely in terms of the ratios α 2 /α 1 and α 2 /α 3 , referred to henceforth as the curvature parameters. For example, the requirement for convexity in both the lower and upper domain is When one or both of α 1 and α 3 are zero, one or both of these curvature parameters may be undefined. This can cause problems for its graphical representation, some of which will be shown below. In such cases, a continuity adjustment through the addition of a small positive constant to both the numerator and denominator in (10) can be useful.
It is possible to rewrite Table 1 wholly in terms of the curvature parameters, but more importantly using the curvature parameters it is possible to visualize the curvature of an estimated response style in a single plot. Figure 3 illustrates the situation by plotting α 2 /α 3 against α 2 /α 1 , as well as incorporating the response style classification regions derived from Table 1. When both curvature parameters equal two, no response style is present. Due to the fact that both curvature parameters has the range [0, ∞), a more symmetric plot is arrived at by using the base-2 logarithmic transform and centring-this is illustrated in Section 5.

Dual Scaling Method for Multiple Response Styles
To allow for multiple response styles, suppose that each of the n individuals belongs to one of K response style groups, the exact membership being unknown. Let the n × K matrix G contain as columns the group indicator vectors {g k } K k=1 , each indicating which individuals belong to that specific group. The column scores {b k } K k=1 are of the same form as b in Equation (8), but are now group-specific by replacing b 2 with b 2k = Mα k . This allows for the different groups to have different response mappings by letting the spline parameters α k = (μ k , α 1k , α 2k , α 3k ) vary between groups. The object scores b 1 and the row scores a remain fixed across all response style groups.
The group membership G needs to be estimated, together with the 2n-vector a of optimal scores for the individuals and the column score vectors b k of length (m + q − 1) contained in the K columns of B. It is required for monotonicity that α ik ≥ 0 for all i and k. The loss function in Equation (5) must be adjusted to allow for the multiple response styles as well as for the spline restrictions. This constrained DS method for the detection of response styles can be formulated as min a,B,G L(a, B, G) 976 PSYCHOMETRIKA The adjusted loss function (compare Equation (5)) is Again, c is a proportionality constant, and the diagonal matrices D g k are constructed as In this context, diag(x) denotes the diagonal matrix with x on the main diagonal. Through using the {D g k } K k=1 in (12), individuals are associated with the corresponding b k for their group. As K increases, the number of parameters in the model increases and consequently the loss function L decreases as well. Therefore, when considering how the value of L changes for different values of K in a scree plot, it is convenient to standardize these values to the unit interval [0, 1].
An algorithm for minimising L is discussed in the next section.

An Alternating Nonnegative Least Squares Algorithm
Solving the optimization problem in (11) requires finding a, B, and G under the appropriate restrictions. The approach discussed here alternates over two steps: 1. The algorithm combines ALS and nonnegative least squares (NNLS; Lawson and Hanson, 1974) to approximate the optimal a and B for a given group membership matrix G. This involves fixing the value of a, estimating the optimal B with NNLS, and then updating a by ordinary least squares (OLS) based on the estimate of B. This ALS process is repeated for a given G until numerical convergence is observed. 2. For fixed a and B, G is updated by a K -means type algorithm given the values determined for a and B. This step simply allocates each individual sequentially to the group which minimises the loss function.
The algorithm alternates between steps one and two until the loss function L changes by less than a small positive constant. Note that starting values for both a and G are required. For the a vector standard normal random numbers are simulated, while random assignment to K groups is used for G. Block-relaxation algorithms such as this is guaranteed to converge monotonically, albeit to a local minimum; therefore multiple random starts are required (De Leeuw, 1994). The related issues of local optima in K -means clustering and categorical principal components analysis are discussed in Hand and Krzanowski (2005) and Kooij (2007, Chapter 2), respectively. In Appendix 2 an overview of these local optima is given in the context of the empirical example (Section 5). The optimization process is described in more detail in Algorithm 1, with an exposition of its derivation deferred to Appendix 1. The formulation is for a single starting configuration of G, and needs to be repeated for multiple such configurations. Parameters that need to be specified include n a , the number of (random) starts used for a, the maximum number of iterations maxit a and maxit G for the ALS and K -means phases, respectively, and also the numerical tolerances 1 > 0 and 2 > 0 for these two steps. Note that the spline restrictions are sufficient as normalization constraints in the ALS part of the algorithm, and hence the vector a is only standardized to a 2 = 2n after convergence.
To update G, the algorithm cycles through all respondents in turn. For the current respondent i, it calculates for each class what the loss function would be if respondent i were assigned to that class, given the current classification of all other respondents. This respondent is then moved to the class with minimum loss (or stays in the same class if this is already the best choice). The algorithm then proceeds to the next respondent i + 1, and starts again with respondent 1 once the last respondent is reached. Once a complete pass over all respondents are made where no change in classification occurs, the updating of G terminates and the algorithm returns to the ALS updating step.

Selecting the Number of Response Style Groups
To select the number of groups K , a scree plot of the loss function for different values of K can be used. The aim is to choose the smallest K such that larger values do not significantly reduce the loss. This method was introduced by Cattell (1966) and has been widely adopted. The DS method also separates individuals based on the shape of the response transformations and rating frequencies in the groups. This supplementary information can be helpful for choosing K in cases where the scree plot is not conclusive. This will be illustrated in the empirical application of Section 5.

Purging Response Styles
Once the estimatesâ,B andĜ have been obtained, these can be used to create a version of the original data X which is purged of response styles. All that is needed is to use the splines estimated for each response style group to assign optimal scores to the rating scale. Then these scores are substituted in X by replacing every rating with the appropriate optimal score.
Determining the optimal scores of the ratings proceeds by evaluating the splines as before, but now at the ratings themselves and not at the boundaries. This simply requires constructing a design matrix from the spline basis functions evaluated at the rating categories 1 to q, where for categories 1 and q, respectively, L and U are used in the notation of Section 3.2. As before, a single interior knot t at the middle of the domain [L , U ] of the splines are assumed. Let this matrix be M * . The optimal scores are then simply determined as M * α k . In Section 4.3 a simulation experiment is conducted to assess how accurately this method can reproduce a known underlying correlation structure from contaminated data.

Simulation Model
The simulated data was generated in a three-step procedure. First, the true underlying preference structure for the m objects is obtained by simulating m random numbers from a U (0, 1)distribution. These are gathered into the m-vector μ. Second, individual preferences are generated by simulating n times from each of m truncated normal distributions respectively centred at the elements of μ. The individual preferences are given by δ i = μ + ε i , with ε i , i = 1, . . . , n, representing the individuals deviation from the mean.
Truncation is done at 0 and 1 so that response styles can be defined on the closed interval [0, 1]. Note that the use of truncation avoids overflow problems at the lower and upper ends of the response style mapping, and hence improves on the original approach of Van de Velden (2007). The truncated normal draws are done independently and with error variance σ 2 , which is an important parameter because it determines how pronounced the multi-modality of the mixture of truncated normals over [0,1] is. An increase in the value of σ implies that it easier to detect response styles as the actual preference structure plays less of a role in forming the ratings.
The resultant true preferences are randomly divided into different response style groups. Finally, these data are discretized to m categorical variables with q-point Likert-scales, according to the cut points on [0, 1] implied by the chosen K response styles. These response styles are parameterized to come from the family of monotone quadratic splines outlined in Section 3.2.
In the simulations, choices must be made regarding the following: the number of objects m, the number of rating categories q, the underlying standard deviation σ , the number of response styles K , as well as their shapes defined by α k , k = 1, . . . , K , the sample size n and how this is divided among the K groups, namely n k , k = 1, . . . , K .

Assessing Classification Performance
The first simulation study assesses the classification accuracy of the DS method. It is assumed in this experiment that the number of groups K is known beforehand. For each of the experimental conditions, 50 simulated data sets were constructed and the DS method applied. For each data set estimation was based on 15 random starts for G, and for each of these starts the ALS procedure was initialised from 50 different random configurations for the row scores a.
The 108 experimental conditions consisted of the following. The number of objects m was varied over 10, 20 and 30 items. The rating scales employed were either q = 5 or 7-point scales. Sample sizes of n = 200, 1000 and 5000, respectively, were used. The number of groups K were either 3 or 5. For each of these K , it was assumed that one of the groups has a linear response mapping (that is, a group with no response style). The additional K − 1 groups exhibited response styles through nonlinear mappings. For K = 3, these additional groups were acquiescence and extreme responding, since Baumgartner and Steenkamp (2001) found that these are most prevalent in survey data. For K = 5, groups for disacquiescence and midpoint responding were also added. The corresponding spline functions used to simulate from are shown in Figure 4. The sample of n respondents was assigned to the groups by allocating either 20, 50 or 80 % of respondents equally among the K − 1 response style groups. These percentages represent the amount of contamination in the simulated data. The remaining percentage of respondents was assigned to the group exhibiting no response style. The latent standard deviation σ was fixed at 0.1 for all experiments.
To assess the classification performance of the method, the adjusted Rand index as well as the percentage correctly classified (the so-called hit rate) were computed. The adjusted Rand index (ARI) of Hubert and Arabie (1985) assesses the similarity between two partitions, adjusted for chance correspondences between these partitions. The upper limit of the ARI is one, and indicates perfect agreement. An ARI of zero indicates that the method does not improve on random assignment, with all positive values indicating an improvement. Negative ARI values are also possible, and indicate poorer performance than random assignment. The ARI is in general lower than the hit rate, and can be considered as a more objective measure of performance.
For each of the 108 experimental conditions, Tables 2 and 3 show the average values over the 50 simulated data sets. It is apparent that the sample size n does not have a large influence on the ARI and hit rate. The number of groups K is very important for performance when the contamination percentage is low (20 %). This is because for K = 5 groups the 20 % of contaminated data points must be divided into four groups instead of 2 when K = 3, which results in groups with very low proportions n k /n of the total sample. The low performance here is somewhat compensated for by using more items, such as m = 30, but for K = 5 groups even more items are needed. In general, using more items increases the classification accuracy. Using a larger number of rating categories q also increases performance, but mostly so with fewer groups (K = 3). The method improves on random assignment-especially in cases with higher response style prevalence and 20 or more items the improvement is substantial.  Table   3.
Average hit rates for 50 simulations at the different parameter settings.

Recovering Underlying Structure through Data Cleaning
The simulation model of Section 4.1 assumes that, given the expected value of the object scores m, the objects are independently distributed as truncated normal distributions. Although the true correlation matrix between the objects thus is the identity matrix I, the observed correlations after the response style contamination is often inflated. To show improvement, the cleaned data derived as in Section 3.6 should have correlations resembling independence more closely. A visual example is given in Figure 5 for simulated data (m = 20, K = 3 similar to the conditions used in Tables 2 and 3), where the colours indicate the magnitude of the Pearson correlations. It is evident that the response styles artificially inflate the correlations. When q = 7, the cleaned data to some extent succeeds in removing the spurious correlations, but when q = 5 the situation is not much improved.
The conditions under which the cleaned data can be expected to provide a better representation of the underlying correlation matrix was studied further through simulations. For the different values of K , n, q, and the proportion of response style contamination used in Section 4.2, 50 simulated data sets were constructed and cleaned through the DS method. Here m = 20 was fixed for simplicity. For each of these data sets, the root mean square error (RMSE) between I and the empirical Pearson correlation matrix for the contaminated data was calculated, where for commensurable matrices V and W. Similarly, the RMSE comparing I with the empirical Pearson correlations of the cleaned data can be computed. A reduction in the RMSE when using the (a) (b) (c) Figure 6.
An example of the correlation structure imposed by the Clayton copula's, in terms of Kendall's τ . cleaned data as opposed to the contaminated data indicates that the cleaned data has a correlation structure which matches the true correlation structure more closely. A two-sample Wilcoxon test, also known as the Mann-Whitney test, (e.g. Rice, 2007) was used to test the null hypothesis that the RMSE is equal for the contaminated and cleaned data against the one-sided alternative that the RMSE for the contaminated data is greater than that of the cleaned data. The results are quite clear: when q = 7 the null hypothesis is always rejected ( p < 0.001) in favour of the alternative, whilst when q = 5 the null hypothesis cannot be rejected even once (all p > 0.2). It can therefore be deduced that when a sufficient number of rating categories q are used, the correlation structure of the cleaned data is more representative of the true underlying structure of the data.
A related question concerns the performance of the method in the presence of a nontrivial correlation structure. To impose such a structure whilst retaining truncated normal marginal distributions for the objects, a copula is used (note that the truncated multivariate normal distribution does not guarantee truncated normal marginals). A copula is a multivariate distribution function C(u 1 , u 2 , . . . , u m ) with uniform marginals (Hofert & Mächler, 2011). According to Sklar's theorem (Sklar, 1959;Hofert & Mächler, 2011) a multivariate distribution function F with marginals {F j } m j=1 can be constructed as The marginal truncated normal distributions can be achieved by the inverse probability integral transform. The dependence structure between the variables is solely determined by the copula.
Here two independent Clayton copula (Clayton, 1978) functions will be used to impose a correlation structure in terms of Kendall's τ , a well-known measure of rank correlation (see Kendall, 1938;Hofert & Mächler, 2011). The structure induced here for m = 20 is as follows: the first ten objects are correlated with τ = 0.2, independent of the other ten objects which are correlated with τ = 0.35. These τ values amount to Pearson correlations of approximately ρ = 0.3 and ρ = 0.5, respectively (an approximate relationship is ρ ≈ sin(τ π/2)-see Kendall and Gibbons (1990)). It is also possible to introduce negative correlations by using 1 − U instead of U in the inverse probability integral transform. In the application here these reversals are made randomly with differing probability γ . The theoretical, observed and cleaned correlations given by Kendall's τ for one such copula is illustrated in Figure 6, with m = 20 and q = 7. The difference in RMSE can again be used to evaluate the effect of the data cleaning on the correlation structure, now using Kendall's τ since the Clayton copula's use this measure directly. A simulation study was conducted for m = 20 objects with the other parameters varying as before. For each combination of the parameters, the RMSE was calculated for 50 randomly generated data sets according to the copula model described above. Then for each data set the constrained DS model was fit as before, and a cleaned data set constructed. The difference in the RMSE for the contaminated data as compared to the cleaned data was recorded. Table 4 presents the average reduction in RMSE as a result of cleaning the data with the DS procedure. As before the two-sample Wilcoxon test was performed. Significant improvements were found in all cases except those printed in italic in Table 4. It is apparent that the cleaned data improves the RMSE in all cases, except where both q and K are small and the proportion of contamination is moderate (50 %) to large (80 %). Except for these circumstances, the constrained DS method improves the estimation of the true correlation structure by removing the response styles effects.

Recovering the Parameters in Principal Components Analysis
It is possible to examine how well the method can recover parameters after the contaminated data have been cleaned of response styles. For simplicity, PCA (e.g. Johnson & Wichern, 2002) was used as analysis method, a well-known multivariate dimension reduction technique that seeks to summarize the majority of the variation in the data by a few uncorrelated linear combinations of the original variables (the so-called principal components). Subsequent principal components each account for as much variation in the data as possible, subject to being uncorrelated with the previous components. PCA relies on the eigendecomposition of the covariance (or correlation) matrix, where the eigenvalue-eigenvector pairs give the variance accounted for and the linear combination (also known as the principal component loadings), respectively, for each component.
The following procedure was used to compare the PCA conducted on the true correlation matrix to those conducted on the correlation matrices of the cleaned and contaminated data, respectively. First, a matrix of standard normal random numbers of dimension m × r is simulated, with r denoting the required rank of the PCA solution. The rows are then standardized to length one; denote this matrix by L. The simulated correlation matrix is then R = LL , with the corresponding covariance matrix assumed to be = σ 2 R. Here σ 2 is the same error variance as assumed in Section 4.1. Since the decomposition R = LL is not unique, the eigendecomposition of R is used to re-express R as R = L r L r , where L r is constructed from the first r eigenvectors and singular values of R.
Second, a population mean vector μ for the m items is simulated as uniform random numbers. The true underlying data for the respective respondents are then simulated from the multivariate normal distribution with mean vector μ and covariance matrix . The resultant matrix represents the uncontaminated data. Subsequently, response styles are added to arrive at the contaminated data. The same response styles as in Section 4.2 were used, the only difference being that the range [L , U ] of the splines was set to be the 1st and 99th percentiles of the sampled values, respectively. Any spillovers outside the range of the splines are then added to the lowest or highest rating category. The interior knot t was fixed at the mean of the sampled values.
Finally, the constrained DS method was applied to the contaminated data, assuming that the correct number of response styles K are known and using 15 and 50 random starts for G and a, respectively. Based on this, a cleaned data set was constructed, from which the cleaned empirical correlation matrix,R c is obtained. Similarly, letR o be the empirical correlation matrix of the observed (i.e. the contaminated data). To compare the PCA solutions on these correlation matrices to that of R, the decompositionsR c ≈ L c L c andR o ≈ L o L o are constructed as before assuming that the researcher is able to identify the correct rank r of R. The RMSE between L r and L c is then compared to that between L r and L o to determine whether the PCA structure of the cleaned data reflect the actual structure better or worse than the contaminated data.
For this simulation study, it was assumed that all groups are of equal size. The total sample size was varied over n = 200, 1000 and 5000 respondents as before, with either K = 3 or 5 Table   4.
Average proportional improvement in the RMSE of the cleaned over the contaminated data.

986
PSYCHOMETRIKA response styles added. Again, either q = 5 or 7 response categories were studied, with m = 10, 20 or 30 items. The rank of was either r = 2, 3 or 4. For each combination of these factors, 100 simulated data sets were analysed. The results are shown in Table 5, which displays the average relative improvement in the RMSE of the cleaned over the contaminated data. It is evident that the PCA structure is better reflected by the cleaned data when q = 7. From the table it can therefore be concluded that rating scales of more than five categories are ideal for the method. For rating scales with q = 5, marginal improvements are seen only for small numbers of items. It is reassuring that the method does not yield significantly worse result for less refined rating scales such as q = 5. The improvement of the method is greatest for small values of m. The number of segments K does not influence performance. Finally, the method performs best for low values of r , which corresponds to simpler underlying structures.

Application
To illustrate the method in an empirical application, consider data obtained from an anonymous multinational food and beverage conglomerate regarding an investigation of product perceptions for 20 similar products. These include in-house products as well as those of competitors. Data were collected from n = 268 panellists, who scored each product on seven different sensory attributes using a 9-point Likert scale. Each product is rated on all seven attributes (or, equivalently, items), so that there are 140 items collected in a data matrix with 268 rows and m = 140 columns. The Likert scale ranges from 1 ("low") to 9 ("high"), and hence q = 9. Since these products are generally liked by consumers, acquiescence can be expected. The data set is available in coded form as part of the cds package (Schoonees, 2015) for the statistical computing environment R (R Core Team, 2014). This can be obtained online from the Comprehensive R Archive Network. The package contains the software used for all computations in the present paper.
The first step is to select K by inspecting the loss function through a scree plot. Consideration is also given to the curvature properties of the splines as well as how well the method separates groups of panellists who exhibit different distributions of rating scale use. It is expected that once spurious clusters are added at least two of the estimated response curves will be very similar, and/or that two groups will on aggregate use the rating scale in a very similar fashion. For each of K = 1, 2, . . . , 8 groups, the algorithm was run from 50 different random starts for the grouping matrix G, where appropriate. Also, 50 random starts for the ALS part of the algorithm was used. Appendix 2 gives insight into the effect of local optima for these data. Figure 7 shows the resulting (rescaled) scree plot. There does not seem to be a clear "elbow" in the plot, although it is apparent that K = 3, 4 and 5 are the options requiring closer scrutiny. As K increases beyond 5 not much improvement in the loss function is observed.
The response mappings for the solutions K = 1, . . . , 8 are displayed in Figure 8. In these plots the horizontal axis contains the original rating scale, while the vertical axis denotes the optimal scores assigned to the Likert scale. The area of the bubbles superimposed on the transformation plots indicate how often each rating category is used, aiding in the interpretation. A first observation is that (strictly, almost) all the detected response mappings have the characteristic convex shape of acquiescence. This means that all panellists have a tendency to use positive ratings frequently. The groups differ with respect to the intensity of the acquiescence.
Furthermore, the range of optimal scores that is assigned to each group, namely 3 i=1 α ik in terms of the spline parameters set out in Sections 3.2 and 3.3, depends on the within-group variability of rating scale use. Groups where individual panellists' rating scale use show more variability from the group's aggregate rating scale use are assigned optimal scores with a wider range. Hence the method treats such groups, i.e. groups containing more individualistic respondents, as more informative as opposed to groups with more uniform response behaviour. Table   5.
Average proportional improvement in the RMSE when comparing the principal component loadings between the cleaned and contaminated data. Scree plot for the sensory data.
A closer look at the distribution of the rating scale use in the identified groups reveal that all groups in the solutions K = 3, 4 and 5 show visually different distributions, except group I and group III when K = 5. The relative frequencies with which each rating is used in each of the groups when K = 5 are shown in the barplots in Figure 9. It is obvious that groups I and III have very similar aggregate behaviour when K = 5 . This is however not immediately apparent from the spline functions displayed in Figure 8, which assign different optimal scores to these groups.
A more formal comparison can also be made by using the Kullback-Leibler divergence (KL; e.g. Lehmann & Casella, 1998) between the distributions of different groups. This is also known as entropy distance and is often employed in the construction of classification trees (e.g. Breiman, Friedman, Stone, & Olshen, 1984). It is an asymmetric measure of the dissimilarity between two density functions, the reference density f and another density g, which is defined as E f [log( f (X )/g(X ))]. When f = g, the entropy is zero; otherwise it is positive. For discrete distributions the integral is replaced by a summation. In the present context, letf 1 , . . . ,f q and g 1 , . . . ,ĝ q denote the observed proportion of all answers in two different groups that use ratings 1, . . . , q, respectively. The observed KL divergence between these groups, with respect tof , is then q h=1f h log(f h /ĝ h ). Assessing the pairwise KL divergence for all pairs of groups (and using both f and g as reference) show that indeed the abovementioned two groups diverge the least among all pairs when K = 5-see Table 6. Since the method is designed to detect groups with different aggregate rating scale use it can be concluded that the addition of a fifth group is spurious and therefore K = 4 is selected. The findings of Figure 9 are therefore supported by this analysis.
Consider the results for K = 4 groups. These four groups consist of 67, 71, 61 and 69 panellists, respectively. The rating scale usage of these groups are displayed in Figure 10, panels (a)-(d). Figure 11 displays the optimal scores assigned to the ratings in the different groups as well as their curvature chart. The curvature chart includes an approximate 95 % confidence ellipse constructed for the parameter estimates of 5000 data sets simulated under the assumption that no response styles exist. Any group falling outside this band therefore has a significantly nonlinear response mapping and hence a response style.
Group I represents acquiescence as mainly ratings 6-9 are used by panellists. There is a slight boundary effect, as also with the other groups, in that category 9 is used less often than category 8. Because the ratings 6-9 are frequently used, the optimal scores assigned to these are close to zero. The estimated response mappings for K = 1 (top left) to 8 (bottom right) groups, respectively. The area of the bubbles are proportional to how often that particular rating is used. The group sizes are also shown in a legend. Groups are labelled sequentially; the legend should be read vertically and then horizontally.  The most meaningful optimal scores are assigned to the lower categories since when these are used it contains more information for this group of panellists. Overall the information provided by this group is low since the range of optimal scores assigned is very narrow. This is because the group members display low variability with respect to their rating scale use. This is evident from Figure 10(e), which plots the frequency with which each rating is used per individual. Group II represents a more extreme acquiescence where categories 7-9 are often used. The range of assigned optimal scores, and hence information, is similarly narrow, but shifted further to the left (a) (b) Figure 11.
a Optimal scores assigned to the K = 4 response style groups, from rating 1 (left) to rating 9 (right). b Curvature plot similar to Figure 3 for the four groups, with the axes now transformed to obtain a more symmetrical plot. The ellipse in the centre is an approximate 95 % confidence ellipse for no response style.
since the upper categories are used even more frequently. Since the response mapping is concave in the lower part of the domain there is a slight deviation from acquiescence towards an extreme response style. Groups III and IV both exhibit a mix of acquiescence and midpoint responding. This is evident from the relative frequencies in Figure 10 and the curvature chart in Figure 11(b). In these groups generally ratings 4-8 are preferred. Based on the range of optimal scores assigned to them these Optimal scores for each of the seven questions, separated by product and with similar items depicted by the same colours.
consist of the panellists providing the most information. Especially Group III is endowed with the most meaningful spread of optimal scores, and can be seen in Figure 10(g) to exhibit the most within-group variation.
Finally, consider the optimal scores assigned to the items as displayed in Figure 12. It is evident that Product R, and to a lesser extent Products N, D, E and F, received the lowest ratings. In contrast, Product P was the best performing one. By using a cleaned data set constructed by replacing the ratings by optimal scores further analyses can be conducted which are less influenced by the presence of the response styles.

Conclusions
A method that relies on the properties of DS for successive category data to detect response styles in categorical data was presented. It combines newly suggested spline models for four main types of response styles with the original DS method to construct optimal scores for the boundaries between rating categories. These optimal scores are sensitive to the presence of response styles. The method was adapted to allow for multiple response style groups by utilizing a k-means type procedure, which is combined with a constrained ALS algorithm using NNLS to fit the model.
Both the ability of the method to detect response styles and the improvement in correlation structure that results from a cleaned data set where ratings are replaced by optimal scores were studied. It was found that using 30 or more items and a rating scale of 7 or more categories yields great improvements in the classification of individuals to different response style groups. When fewer rating categories are used other factors become important, such as the extent to which response styles are present in the data. Also, when using a 7-point scale or more, the resulting cleaned data provide a more accurate description of the true substantial content in the data, after accounting for different response styles. The use of the method to identify respondents who provide similar amounts of information in their responses to a survey was illustrated on an empirical data set.
The number of response style groups to retain was selected on the grounds of a scree plot of the loss function, combined with the distribution of rating scale use in the different response style groups. It remains to be seen whether a more formal selection method can be derived. Other grounds for further research include alternatives for or additional restriction to the spline functions, 992 PSYCHOMETRIKA and whether more freedom is needed by allowing for differences between the m object scores in different groups. (x 1 , x 2 ) − (y 1 , y 2 ) 2 = x 1 − y 1 2 + x 2 − y 2 2 , it follows that Therefore b 1 can be updated by minimizing the first summation in Equation (19) by OLS independently of {α k } K k=1 . Since α ik ≥ 0 for all i and k, the latter vectors are updated for each k by using NNLS to minimize each of the individual elements of the second summation.

Appendix 2
Here a short exposition is given of the spread of local optima for the empirical example. Specifically, the variability of the loss function for the 50 random starts of G is shown in Figure 13. The curves are ordered from K = 2 at the top to K = 8 at the bottom. It is evident that only a single random start typically produces the best result. In general, the local optima is less stable for larger values of K , as can be expected. It is evident from this example that attention must be paid to the number of random starts used in empirical applications of such algorithms. These results suggest that the "best of 20 random starts" rule often favoured by practitioners of K -means clustering may not suffice (Hand and Krzanowski, 2005); a pragmatic approach is required.