Factor analysis is a multivariate statistical technique which was developed to test hypotheses regarding the correspondence between scores on observed variables (surface attributes), or indicators, and the hypothetical constructs (internal attributes), or latent factors, presumed to affect such scores (Kline 2013). The foundation of factor analysis is the assumption that the internal attributes exist. The internal attributes are the hypothetical constructs that can be used for understanding and accounting for the observed phenomenon. The internal attributes are more fundamental than surface attributes and can not be measured directly; however, their effects are reflected from the measures of surface attributes. The basic principle of factor analysis is that the internal attributes influence the surface attributes in a systematic manner, thus, measurements obtained from indicators are, at least in part, the result of the linear influence of the underlying latent factors (Tucker and MacCallum 2016).
Factor analysis has three major applications. First, it can be applied for the reduction of the number of indicators into a smaller set. Second, it can be used to establish the underlying dimensions between the indicators and the latent factors, thus generating or refining the theory. Finally, factor analysis provides construct validity evidence of the self-reporting scales (Thompson 2004; Tabachnick and Fidell 2001; Taherdoost et al. 2014). There are two discrete categories of factor analysis techniques: exploratory factor analysis (EFA) and confirmatory factor analysis (CFA). The EFA estimates unrestricted measurement models whereas, CFA analyses restricted measurement models (Kline 2013). Thus, for CFA the indicator-factor correspondence needs to be specified, whereas, for EFA there are no specific expectations regarding number or nature of underlying factors. The EFA and CFA techniques are further described in the subsections below.
Exploratory factor analysis
Exploratory factor analysis allows researchers to explore the main dimensions to generate a theory, or model from a relatively large set of indicators (Thompson 2004; Pett et al. 2003; Taherdoost et al. 2014). The EFA is particularly suitable for scale development and applied when the theoretical basis for specifying the numbers and patterns of common latent factors is unavailable (Taherdoost et al. 2014). The ultimate goal of EFA is to determine the number of latent factors that are required to explain the correlations between the indicators, thus, establishing the theory. The EFA is based on the common factor model that postulates that each indicator in a set of indicators is a linear function of one or more common factors and a unique factor (Thurstone 1947). The common factors are the unobservable latent factors that influence more than one indicator in a set of indicators and are presumed to account for the correlations among the indicators. The unique factors are the latent variables that are assumed to influence only one indicator from a set of indicators and do not account for the correlations among the indicators. The objective of common factor model is to understand the structure of correlations among the indicators by estimating the relationship patterns between indicators and latent factors indexed by so-called factor loadings (Fabrigar et al. 1999). The goals of EFA for the current study were twofold: (1) probe the validity of the factor structure obtained from Hinterleitner et al. (2011b), and (2) explore the measured affective “loadings,” i.e. valence, arousal and dominance, on the obtained factors.
The EFA approach is sequential and linear, and involves many options, therefore, development of a protocol for analysis is imperative. There are several methodological issues associated with the EFA procedure, one of them being the indicator selection process. The indicator selection is an absolutely critical step as it determines the quality of the factor analysis (Fabrigar et al. 1999). Therefore, for the current study the indicator selection process involved selection of indicators based on P.85 recommendations (ITU-T 2016) and previous research (Hinterleitner et al. 2011a, b), that has helped in identifying important hypothetical constructs or latent factors associated with synthesized speech QoE.
A second methodological issue relates to the sufficiency of available data for EFA. The first consideration towards establishing the sufficiency of data is sample size. Various recommendations and opinions exist regarding the optimum sample size for EFA. For example Comrey and Lee (2013) suggested that a sample size of 50 is very poor, 100 is poor, 200 is fair, 300 is good, 500 is very good and 1000 is excellent. Moreover, MacCallum et al. (1999) illustrated that with commonalities greater than 0.6 and with each latent factor defined by several indicators (Henson and Roberts 2006), sample size can be relatively smaller. Other studies, in turn, have suggested that the nature of the data is what should determine the adequacy of the sample size (Fabrigar et al. 1999; MacCallum et al. 1999). Another recommendation towards establishing sample size adequacy is based on sample to variable ratio, denoted as N:p where N refers to the sample size and p refers to number of indicators. The rules of thumb for N:p values have ranged from 3:1 to 20:1 in the literature (e.g., see Costello and Osborne 2005).
An additional consideration towards establishing data sufficiency is the factorability of the correlation matrix. A factorable matrix consists of several sizeable correlations, therefore, the correlation matrix must be inspected for correlations above 0.30 for factor analysis to be meaningful (Tabachnick and Fidell 2001). Finally, the so-called Kaiser–Meyer–Olkin (KMO) measure (Kaiser 1970) and Bartlett’s test of sphericity Bartlett (1950) have been proposed as measures of accurate sampling adequacy (Taherdoost et al. 2014). The KMO measure is indicative of the proportion of variance among the items that is common, thus suggesting an underlying latent factor. The KMO measure varies between 0 and 1, and values above 0.5 are typically considered to be adequate for EFA (Kim and Mueller 1978). The Bartlett’s test of sphericity, on the other hand, tests the hypothesis that the correlation matrix is an identity matrix, suggesting that all variables are uncorrelated (Hair et al. 2009). If significance values are found lower than an alpha level of 0.05, the null hypothesis is rejected, thus suggesting that the correlation matrix is the identity matrix and that items are unrelated. In the current EFA study, the sample size (N) was 264, as 6 subjects scored 44 speech stimuli, and the number of indicators (p) used were 11 thus, leading to a N:p ratio of 24:1. The KMO measure and Bartlett’s test of sphericity are also used herein to establish sample adequacy.
A third methodological issue in performing EFA relates to the factor extraction method. There exists several factor extraction methods, such as principal component analysis (PCA), principle axis factoring (PAF), and maximum likelihood (ML) (Costello and Osborne 2005; Hair et al. 2009). The PCA based method computes factors without any regard to the underlying latent factors, whereas the PAF based method is used for the determination of the underlying latent factors related to the indicators (Taherdoost et al. 2014; Fabrigar et al. 1999). The maximum likelihood based method, in turn, is more suitable when the data is normally distributed and allows the computation of various goodness-of-fit measures for the model (Fabrigar et al. 1999). The PCA and PAF based methods are the most commonly used methods for EFA (Taherdoost et al. 2014). In the present study, PAF based factor extraction was used as it does not require the data to be normally distributed and is less likely to produce improper results compared to ML based methods (Fabrigar et al. 1999).
A fourth methodological issue involves choosing the factor retention method. The number of factors to be retained is an important consideration as under- or over-extraction of factors can result in substantial errors, thus affecting the efficiency and meaning of EFA (Taherdoost et al. 2014). There are various criteria for factor retention, such as Kaiser’s criterion and Scree test. Kaiser’s criterion recommends to retain all the latent factors that have eigenvalues greater than one, as this is the average size of eigenvalues in the full decomposition (Kaiser 1960). The Scree test, in turn, recommends to explore the graphical representation of the eigenvalues for discontinuities, as the number of data points above the discontinuity represents the major factors (Hair et al. 2009). In this study, the number of factors to retain were determined using a combination of both Kaiser’s criterion and the Scree test.
Another EFA related methodological issue involves the selection of the rotation method. The rotation of factors helps to produce simplified and interpretable results by maximizing high factor loadings and minimizing low factor loadings. There are two categories of rotation techniques namely, orthogonal and oblique rotation. Orthogonal rotation produces factors that are uncorrelated to each other, whereas oblique rotation results in factors that correlate to each other, thus leading to the production of correlated construct structures (Costello and Osborne 2005). There exists various methods for orthogonal and oblique rotation, such as varimax, quartimax and equamax for orthogonal rotation and quartimin and promax for oblique rotation (Mulaik 2009). In the current study, we used the promax oblique rotation method for EFA as it produces simplified factor structures while minimizing the cross-loadings (Hinterleitner et al. 2011b). The promax rotation begins with varimax rotation followed by raising the pattern coefficients to a higher power \(\kappa \) (Kappa), that forces near-zero coefficients to approach zero faster (Mulaik 2009). The \(\kappa \) value usually ranges between 1 and 4, and for the current study we used a \(\kappa \) value of 4, as in (Hinterleitner et al. 2011b).
Lastly, the final issue relates to the interpretation of the produced factor structure and naming the construct based on the factor loadings. This final step reflects the theoretical and conceptual intent and allows for better model interpretation (Hair et al. 2009). In order to meaningfully interpret a factor, at least 2 to 3 indicators must load onto it. The theoretical and conceptual interpretation of the factors computed in our study was motivated by previous research reported in Hinterleitner et al. (2011a, b, 2012), as these were very closely related to the objectives of our study. The interpretation of the factors involves exploring the indicator-factor relationships by investigating the factor loadings. In Hair et al. (2009), authors define a practically significant cut-off threshold of 0.5 for a factor loading to be significant and the indicators that loads at 0.5 or higher on two or more factors are considered cross-loaders. Therefore, in this study a threshold of 0.5 was used for factor loadings to interpret indicator-factor relationships. Moreover, towards establishing a more reliable factor structure, EFA was also performed on random subsamples of data extending from N = 165 to N = 264 with increments of 2. This exploratory analysis allowed us to vary the sample to variable ratios from 15:1 to 24:1, thus further validating the data sufficiency hypothesis. The key steps involved in performing EFA are summarised in Fig. 1.
Confirmatory factor analysis
The EFA forms the conceptual and theoretical foundation for the factor models describing ‘indicator-latent factor’ relationships. The confirmatory factor analysis (CFA), in turn, explicitly and directly tests the ‘fit’ of the factor model developed using EFA (Thompson 2004). The CFA requires researchers to have specific expectations regarding the number of factors, indicator-latent factor relationships, and the correlation between the latent factors, thus an established theory is needed for CFA. The CFA allows for the direct testing of theory and quantifying the degree of model fit.
Formulation
The CFA model can be expressed as follows (Anderson and Gerbing 1988; Vandenberg and Lance 2000):
$$\begin{aligned} x = \tau + \Lambda \xi + \delta , \end{aligned}$$
(1)
where x is a vector of ‘n’ indicators, \(\tau \) is a vector of ‘n’ intercepts, \(\xi \) is a vector of ‘i’ latent factors such that \(i < n\), \(\Lambda \) is a \(n \times i\) matrix of factor loadings that relate indicators to the latent factors, and \(\delta \) is a vector of ‘n’ variables that represent random errors in measurement and measurement specificity of the indicators. In most CFA applications, the intercepts are assumed to be zero and are not estimated (Vandenberg and Lance 2000). The model also assumes that the \(E(\xi \delta ) = 0\) and that the variance-covariance matrix for x, denoted as \(\Sigma \), is given by:
$$\begin{aligned} \Sigma = \Lambda \Phi \Lambda ' + \Theta , \end{aligned}$$
(2)
where \(\Phi \) is the \(i \times i\) matrix of \(\xi \) and \(\Theta \) is the diagonal \(n \times n\) covariance matrix of \(\delta \).
Methodology
CFA mainly concerns with modelling the latent factors that account for commonality among the set of indicators. The commonality between measures of a construct can be depicted using path diagrams (Hoyle 2000). In path diagrams, the measured variables or indicators are represented using rectangles and the unmeasured variables by ellipses. As such, latent factors are represented using large ellipses and unobserved measurement errors that affect indicators as smaller ellipses. The causality relationships are indicated using a single headed arrow, whereas double-headed curved arrows are used to represent variances. Two different models exist—principal factor (reflective) and composite latent variable (formative)—to describe the causality relationships between latent factors, indicators and errors of measurement (Jarvis et al. 2003). The reflective model expects the latent factors to cause changes in the indicators, whereas in the formative model the indicators are expected to affect changes in the latent factors (Jarvis et al. 2003). The decision rules or guiding principles to choose the appropriate model are listed in Jarvis et al. (2003). Based on such rules, the reflective model is shown to be better suited for the current study. Therefore, the indicators were expected to be caused by two unmeasured influences: (1) a causal relationship they share with other indicators (i.e., the latent factor), and (2) a causal influence unique to each indicator that is quantified using the errors of measurement (Hoyle 2000).
There are a variety of statistical packages available for implementing CFA, such as MPlus (Byrne 2013a), AMOS (Byrne 2013b), and lavaan (Rosseel 2012). For the current study, we have implemented CFA using the lavaan (Latent Variable Analysis) package for R. The lavaan package allows the specification of the CFA model (as implemented in the path diagram) through the model syntax. The model syntax is a description of the model that needs to be estimated. The lavaan package allows estimates of various goodness-of-fit measures for the developed model, as detailed next.
Goodness-of-fit metrics
The factor model is considered acceptable if the covariance structure implied by the model matches the covariance structure of the sampled data (Cheung and Rensvold 2002). The acceptability of the model is reflected in its goodness-of-fit (GOF) index. The most common GOF index is the ‘\(\chi ^2\)’ metric that measures the GOF derived from the fitting function that measures the relationship between the observed and the implied covariance matrices. The ‘\(\chi ^2\)’ metric tests the null hypothesis of ‘\(\chi ^2\)’ being equal to 0, which indicates the best possible fit (Cheung and Rensvold 2002). The ‘\(\chi ^2\)’ test, however, is greatly affected by sample size (Cheung and Rensvold 2002). Therefore, other GOF indices have been proposed previously, such as the comparative fit index (CFI), normed fit index (NFI), non-normed fit index (NNFI), incremental fit index (IFI), relative non-centrality index (RNI), goodness-of-fit index (GFI), and standardized root mean square residual (SRMR) (Jackson et al. 2009; Cheung and Rensvold 2002). The CFI, NFI, NNFI, IFI and RNI indices compare the performance of the model with a baseline (or null) model that assumes zero correlation between all the indicators. The GFI, on the other hand, does not compare the model to a baseline model and is computed based on the amount of variance explained by the model. Finally, the SRMR index is estimated by computing the mean absolute value of the covariance of residuals. Typically, values ≥0.90 are considered adequate for the CFI, NFI, NNFI, IFI, RNI and GFI indices (Bagozzi and Yi 1988; Bentler and Bonett 1980), whereas a value of SRMR ≤0.08 (Vandenberg and Lance 2000) reflects the adequate fit of a model. Here, a combination of these indices is used for model validation.
Measurement and structural invariance
The CFA forms a part of larger family of structural equation modelling (SEM) methods. The SEM methods are a broad class of statistical models that consist of two parts: the measurement model and the structural model (Jackson et al. 2009; Beaujean 2014). The measurement model reflects the relationship between the latent factors and the indicators, whereas the structural model relates the relationship of latent factors to each other (Jarvis et al. 2003). Towards establishing the reliability and validity of the measurement and structural model, it is important to establish the between-group invariance (or equivalence) of the models (Vandenberg and Lance 2000). The measurement and structural invariance of the model help verify: (1) the conceptual equivalence of the latent factors across groups, and (2) the equivalence of associations between indicators and factors and between factors across groups. The invariance of models is demonstrated by testing a number of hypotheses regarding measurement and structural invariance (Vandenberg and Lance 2000).
The first hypothesis tests for the equivalence of the pattern of zero and non-zero coefficients in the matrix of factor loadings (\(\Lambda \) in Eq. 1) (Oort 2005). The hypothesis is tested by estimating the same model for each group simultaneously while allowing estimated parameters to differ. The hypothesis tests for the equivalence of the models through a \(\chi ^2\) test. Therefore, a p value ≤0.05 rejects the hypothesis of both the models being equivalent; however, a p value greater than 0.05 leads to configural invariance (Beaujean 2014).
The second hypothesis tests for the equivalence of the unstandardized factor loadings across groups Sass (2011) by constraining loadings to be equal between groups and is referred to as metric or weak invariance. An additional test evaluates the equivalence of unstandardized intercepts or thresholds across groups by constraining intercepts to be equal between groups, and is called scalar or strong invariance. An alternate test evaluates the equivalence of residuals across groups by constraining error variances to be equal between groups and is known as uniqueness or strict invariance (Beaujean 2014). Combined, the configural, metric, scalar and strict invariances evaluate the measurement invariance of the model as these steps are mainly concerned with the indicator-latent variable relationships (Beaujean 2014). Structural invariance testing, on the other hand, evaluates the properties of latent variables, thus involves constraining variances, covariances and means of the latent factors in a stepwise manner.
If the level of invariance for all variables is untenable, a follow-up analysis is needed to determine which indicators are contributing to model misfit. This follow-up analysis involves invariance testing while leaving the non-invariant indicators in the model and not constraining them to be invariant across the groups. The resulting invariance model is said to have partial invariance, that warrants invariance for most of the parameter estimates with the exception of a few parameters within an invariance model (Beaujean 2014). The non-invariant indicators are identified using their modification indices. The modification index estimates the amount of overall decrease in the \(\chi ^2\) value if the previously constrained parameter was freely estimated (Kline 2013). The modification index is interpreted as the \(\chi ^2\) statistic with a single degree of freedom (Kline 2013).
Measurement and structural invariance of the model can be interpreted using the response shift theory (Oort 2005; Sass 2011; de Beurs et al. 2015). The response shift is defined as: “a change in the meaning of one’s self-evaluation of a target construct as a result of (a) a change in the respondent’s internal standards of measurement (i.e., scale recalibration); (b) a change in the respondent’s values (i.e., the importance of component domains constituting the target construct through reprioritization) or (c) a redefinition of the target construct (i.e., reconceptualization)” (Schwartz and Sprangers 1999). The concepts represented by the factors are reflected in the patterns of zero and non-zero factor loadings in the \(\Lambda \) matrix (Oort 2005). Therefore, according to response shift theory a configural non-invariance that leads unequal factor loading patterns across groups, occurs due to reconceptualization. The reconceptualization reflects a change in the meaning of the indicators and, thus, leading to change in the conceptual representation of the latent factors (Barclay and Tate 2014). Furthermore, the metric non-invariance occurs due to reprioritization that involves an indicator becoming more or less indicative of a concept (Oort 2005). The graphical representation of the reprioritization is shown in Fig. 2 indicating underestimation of the indicator values for a group with lower loading values, regardless of the value of latent construct/factor (Wicherts and Dolan 2010). For example, let us assume that one of the latent factors for the present study is listening pleasure and the indicator that shows reprioritization across natural and synthesised voices is acceptance with \(\lambda _{nat} > \lambda _{tts}\), in this case it can be said that the acceptance natural voices will be higher as compared to synthesised voices irrespective of the listening pleasure they offer. The scalar and strict non-invariance, in turn, represent uniform and non-uniform recalibration (Oort 2005). The recalibration process indicates a change in the internal standards of the participants and if the change affects all response options in the same direction and to the same extent then it leads to uniform recalibration (Oort 2005). The graphical representation of the uniform recalibration is shown in Fig. 3 indicating underestimation of the indicator values for a group with lower intercept values, regardless of the value of latent construct/factor (Wicherts and Dolan 2010). Moreover, a non-invariant factor variance model suggests true changes in the variances of the factor, whereas a non-invariant factor covariance model indicates higher level reconceptualization or reprioritization (Oort 2005). Finally, a non-invariant factor means the model reflects true changes in the factor means across groups (Oort 2005). The key steps involved in performing CFA followed by measurement and structural invariance tests are summarised in Fig. 4.