Validation of the German group development (GD) questionnaire

Group Development (GD) is an important variable when researching and evaluating what makes teams successful. We analyzed the psychometric properties of the originally Spanish GD questionnaire with German participants. 501 team members and 104 team leaders, 18 to 65 years old, from a German research organization answered an online survey composed of the GD questionnaire and items related to other group processes of democracy, mutual trust, team spirit, and interest in the team’s tasks. Results confirmed the unidimensional factor structure of the translated Spanish version for the German GD construct. Internal consistency, convergent and discriminant validity were good. The German GD correlated as expected to other constructs, and it showed concurrent validity with respect to the team members’ motivation and interest in team tasks (r = .79, p < .01). We recommend using the GD in German samples to measure team processes that are highly relevant for team effectiveness.

Organizations rely increasingly on teams. Thus, assessing group development is a relavant aspect of human resource development. Group development models (Tuckman and Jensen 1977;Wheelan and Hochberger 1996) concentrate on developmental stages or a group's life cycle. Since they suppose linear development (Miller 2003), they fail to explain what occurs after a team has reached the final stage. We expect this to cause ceiling effects or eliminate variance in teams that have not been set up recently. An alternative would be measuring a continuous variable; researchers have often chosen group cohesion to assess a group's capacity of creating emergence or working as a team. However, this approach has been criticized (Hogg 1993), mostly due to the different operationalizations of cohesion.
As valid alternative instruments for the continuous measurement of a group's development were missing, Meneses et al. (2008) created the group development (GD) questionnaire. The GD instrument consists of 8 items that form a single factor. It assesses group development as the degree to which a group shows the following characteristics: 1) interpersonal relationships among its members; 2) identification of the members with their group; 3) coordination of behaviors, resources and technologies; and 4) members being dedicated to achieving group goals (Navarro et al. 2015). The final GD questionnaire resulted from validation studies in Spain, Brazil, and Venezuela (Navarro et al. 2015).
To create the GD, Meneses et al. (2008) analyzed the literature and identified two main approaches: on the one hand, the stage approach of a developing group (Wheelan and Hochberger 1996) and, on the other hand, three families of continuous group measurements: groupness (Arrow et al. 2000), entitativity (Hamilton and Sherman 1996), and groupality (Roca Cortés 2001).
Among the latter three construct families, they found similarities in 8 dimensions (interrelationship, shared goals, identification with the group, group coordination, shared results, task interdependence, social value of the task, and orientation to group goals). Based on these dimensions, the GD was developed, eventually resulting in a measurement representing the aforementioned 4 characteristics of well-developed groups.
To date, most research on groups has been based on IPO (Input-Process-Output) models (Marks et al. 2001). Despite the criticism regarding IPO frameworks (Ilgen et al. 2005), it may be relevant for future users of the GD to know whether it represents a group process or rather an emergent state. Based on the instrument itself, as well as on the theoretical ingredients used to create the GD, we argue that it measures a group process.
The revised concept of groupness, as presented by McGrath (1984) was originally based on group fuzziness, thus relating to the work group's characteristics (e.g., number of members) and relationship patterns. Groupness is a "fundamental process for the existence of a group and one that explains the extent to which a set of people can be characterized by specific variables that enable it to be perceived as a group or an aggregate" (Meneses et al. 2008, p. 495). The criteria proposed by Arrow et al. (2000) also resemble processes or activities (e.g., coordination of behaviors) rather than cognitive or emotional states.
Entitativity refers "specifically to the degree to which a group really exists, that is, the extent to which it exists as an entity" (Meneses et al. 2008, p. 498). Entitativity mainly represents perceived unity, from the perspective of insiders or outsiders of the respective group. Hamilton et al. (2013) operationalized entitativity through organization and structure among the members, which resembles properties (i.e., emergent states) of the group rather than processes; nonetheless, processes are directly linked to these properties (leadership, performance). Lickel et al. (2000) proposed 5 variables 2000): (a) interaction, (b) common goals, (c) common results, (d) similarity among members of the group, and (e) importance of the group for its members. These variables represent structure, process, and results, which seems to intermingle states and processes.
The concept of groupality originates from Soviet psychology and was adapted by Roca Cortés (2001) in the form of "characteristic attributes of a group that are present in personal interaction and that change both in quality and intensity as a function of time and group activity" (Meneses et al. 2008, p. 503). Roca Cortés (2001 proposed the following 5 dimensions: (a) social value of activity content, (b) communication and interpersonal relationships, (c) group goals, (d) leadership and management, and (e) group organization and group influence over its members. This concept is focused on the interaction between members and a measurement of processes rather than emergent states.
With the 4 characteristics mentioned above, the GD has the greatest overlap with groupality, and thus should be considered a team process. Each of these characteristics resembles a group process: interpersonal relationships among members and coordination of behaviors, resources, and technologies clearly relate to processes between members. Identification of the members with the group, and the members' orientation toward achieving group goals, represent processes at the individual level. Navarro et al. (2015) found the GD to correlate to measures of entitativity (Carpenter and Radhakrishnan 2002) at r = .77 (p < .01) and with group identification (Hogg et al. 1990) at r = .75 (p < .01). Navarro et al. (2016) found that GD scores were in line with Wheelan's GDQ phases (Wheelan and Hochberger 1996): the correlation between the global GDQ score and the GD was r = .74 (p < .01), and the correlation between GD score and the GDQ Phase IV score was r = .79 (p < .01). At GDQ phase IV, the most progressive phase that still comes with a measurement, the group "gets, gives, and uses feedback about its effectiveness and productivity"; it "acts on its decisions"; and it "encourages high performance and quality work" (Wheelan and Hochberger 1996, p. 157). Furthermore, Navarro et al. (2016) reported that the GD correlates with measures of the satisfaction of needs among the group members, r = .67 (p < .01).
The GD predicts the performance of a group, as measured by a questionnaire based on the criteria by Hackman (1987), and by the indicators absenteeism and order and hygiene (Navarro et al. 2015). Navarro et al. (2016) evaluated the incremental validity of the GD instrument with respect to the traditional stage-based model of group development as considered by Wheelan and Hochberger's (1996) questionnaire. In comparison with the GDQ, the GD explained, based on Hackman's (1987) criteria (Navarro et al. 2016), additional variance of a group's self-rating of effectiveness. Leuteritz et al. (2017) found that the GD mediated the relationship between transformational leadership and team effectiveness.
To make the instrument available for use in a Germanspeaking context, and to assess the construct's intercultural relevance, we validated the GD in a German sample. We deemed Germany a good choice because it is Europe's biggest economy.
As a criterion of concurrent validity, we chose the variable motivation and interest from the Team Climate for Learning (TCl) instrument by Brodbeck et al. (2010). Construct and items share a low level of resemblance with the GD questionnaire. Nevertheless, we assumed that members of a welldeveloped group would show high engagement in the team's tasks. Regarding convergent validity, we expected a moderate relationship between the TCL dimension mutual trust and the GD. We supposed mutual trust to be linked to the GD characteristic of good interpersonal relationships between team members. The TCL dimension democracy refers to the absence of dominance by one particular member or leader. Democracy does not have any conceptual overlap with the aspects measured by the GD. Nevertheless, dominant leadership may hamper cooperation (Brodbeck et al. 2010). Consequently, we included democracy as a criterion of discriminant validity.
We also expected the dimension solidarity (German: Zusammenhalt) in the Questionnaire on Working in Teams (German: Fragebogen zur Arbeit im Team; F-A-T, Kauffeld and Frieling 2001) to moderately correlate to the GD, since it reflects good personal interrelationships within the team. The F-A-T operationalizes social reflexivity, which corresponds to the salience of the group (Kauffeld and Frieling 2001); the GD's precursor constructs of groupness and entitativity also refer to the group's salience. Based on Navarro et al. (2015), we furthermore expected a German measure of group potency (Moser et al. 2005) to correlate moderately to the GD scores.

Participants
Team members and their leaders from a German research organization answered an online survey (Table 1). Members and leaders received different questionnaires. Including the leader, each team had at least 4 members (mean team size was 7.9 members). Leaders completed only the GD questionnaire, whereas members also answered items representing group potency, team spirit, democracy, mutual trust, as well as motivation and interest. The category other jobs included: administration, IT services, public relations, mechanical work, and facility services.

Group Development
The GD questionnaire consists of 8 items ( Table 2) that come with a 5-point Likert-scale and compose a single factor. Cronbach's α ranged between .70 and .85 in previous validation studies (Navarro et al. 2015). An example item is "We share tools, resources and information." To assess convergent validity, we examined the GD's correlations to similar constructs, using the following validated instruments. Table 3 and  Table 4 show the internal consistency coefficients and correlation coefficients from our sample.

Solidarity
We administered all 3 items measuring the factor solidarity (German: Zusammenhalt) in the F-A-T (Kauffeld and Frieling 2001).

Democracy, Mutual Trust, and Interest
We used 3 subscales from the Team Climate for Learning questionnaire (TCL) by Brodbeck et al. (2010): democracy (2 items) for discriminant validity, mutual trust (4 items) for  Members feel committed to the achievement of the group objectives.
Los miembros se sienten comprometidos en la consecución de las metas del grupo.

5
There is a low interrelation among all members (inverse).
Hay una baja interrelación entre todos los miembros. 6 We share the same work values. teilen wir in Bezug auf die Arbeit dieselben Werte. Compartimos los mismos valores de trabajo. 7 We share tools, resources and information.
8 An essential task is to take care of our own development as a group.
Una tarea fundamental es cuidar de nuestro propio desarrollo como grupo. convergent validity, and the subscale motivation and interest (4 items) as external criterion of concurrent validity. We maintained the original answer formats in all questionnaires named above.

Procedure
Following the guidelines of the International Test Commission (ITC 2017), we applied a back-translation method, to take care of possible cultural or linguistic differences. Tyupa (2011) proposed a framework for back-translation, which we followed as far as applicable, resulting in these steps: 4. Back-translation of the harmonized translation to Spanish by translator D (female Spanish language teacher; German and Spanish bilingually raised native speaker, no access to source text). 5. Review of the back-translation by translator C, in cooperation with B. 6. Adaptation of the German translation, as the backtranslation identified discrepancies and possible shifts in meaning or context by Translators A and C. 7. Again back-translation of the new German items by translator D. 8. Review of the second back-translation by translators A and C, who now agreed that sufficient conversion had been achieved.
Since the English version has not been validated yet, it was not used. As in the original, a 5-point Likert scale is used.
We obtained permission for collecting the data from the organization's Human Resource responsible and from the directors of the involved departments. We selected the participants from the organization charts of the involved departments. Each participant received an individual access code through encrypted e-mail. As an incentive, we provided a lottery. The survey was hosted on a European Server. Only one researcher (JPL) had access to the tables that connected the participants' codes to real names; the data files themselves were anonymized. This procedure was applied to both samples: members and leaders.
We used the data provided by the team members for all analyses and performed complementary factor analysis and reliability assessment in the leader data set to broaden the evidence. We ran the Confirmatory Factor Analysis (CFA) in AMOS version 22. To evaluate the Structural Equation Models (SEM), we chose χ 2 , χ 2 /df ratio, Root Mean Square Error of Approximation (RMSEA) and Tucker-Lewis-Index (TLI) as reference statistics. Following Kenny (2016) we preferred the more conservative TLI to CFI (Comparative Fit Index). We refrained from using Normed Fit Index (NFI), as it does not penalize model complexity (Kenny 2016), and from using Goodness-of-Fit Index (GFI) and Adjusted Goodness-of-Fit Index (AGFI), following Sharma et al. (2005). Nevertheless, to allow for a comparison with the results from the Brazilian sample (Navarro et al. 2015), we report NFI, GFI and AGFI further down. We relied on Pearson correlations to evaluate the GD's relationships to other instruments.
To check if it was adequate to aggregate the data at team level, we calculated r wg(j) , ICC(1), and ICC(2), following LeBreton and Senter (2008). We required r wg(j) to be above .70, ICC(1) above .10, and ICC(2) above .30. According to LeBreton and Senter (2008), r wg(j) > .70 indicates "strong agreement"; however, this interpretation depends on choosing an adequate distribution for the null hypothesis, as Biemann

Confirmatory Factor Analysis
In the CFA, we accepted TLI at 0.95 or greater, RMSEA at 0.6 or lower (Hu and Bentler 1999), and χ 2 /df ratio at 5 or lower (Schumacker and Lomax 2004). Thus, results confirmed the unidimensional structure of the GD questionnaire in both samples. NFI, GFI and AGFI were similar to those found by Navarro et al. (2015) in Brazil (Table 5).
Since the constructs selected for checking convergent validity showed high correlations among each other and with the GD (Table 3 and Table 4), we checked whether it was appropriate to measure these variables as separate constructs, or if all items rather represented one single common construct. A Common Latent Factor model did not converge due to a Haywood case. Thus, we conducted the following three additional CFAs with the members sample, including the items of the constructs GD, TCL mutual trust, TCL motivation, SABKWSE, and F-A-T solidarity to check the measurement model (MM): 1) The plain measurement model (MM 1 in Table 4) in which each item was only related to its respective construct. Table 4) which treated all available items of the constructs GD, TCL mutual trust, TCL motivation, SABKWSE, and F-A-T Solidarity as items of one single construct: general satisfaction with the team (GST). 3) For comparison, we ran MM 1 and added an exogenous variable (GST) that related to the endogenous constructs GD, TCL mutual trust, TCL motivation, SABKWSE, and F-A-T Solidarity (CFA 3 in Table 5).

2) A model (MM 2 in
MM 1 (no common factor) showed the best fit, the model of MM 3 had a slightly worse fit, and the model of MM 2 (only one construct) showed much lower fit indices, even in the indices penalizing model complexity. Consequently, it was adequate to treat the measurements as separate constructs, even though it was likely to assume that a general factor had significant impact on all of them.
To evaluate the GD's convergent validity at item level, we calculated Composite Reliability (CR) and Average Variance Extracted (AVE). Convergent validity, acceptable with CR > .7 and AVE > .5 (Fornell and Larcker 1981), was partially confirmed, as AVE missed the quality criterion by .02, while CR met the defined quality standard (Table 6).
In both samples, Cronbach's α of the complete 8-item scale was greater than .80 (Table 6) and thus, the internal consistency was acceptable.
If the difference in CFI was .01 or smaller (Cheung and Rensvold 2002), we regarded models as invariant. In the sample of team members, the GD showed invariance between the group of researchers and the other job types, as well as between male and female participants: ΔCFI was .002 in both invariance tests. The leaders sample was too small for testing invariance. Although we found configural invariance between the Brazilian sample (Navarro et al. 2015) and our German sample, metric invariance was not confirmed with ΔCFI at .016.

Measurement Level
In the validation study by Navarro et al. (2016), all variables (including GD) were successfully aggregated at group level. We calculated r wg(j) , ICC(1), and ICC(2) for all 133 teams in which we had measurements from at least two team members available. Mean team size was at k = 3.06. The results are shown in Table 7. For GD, r wg(j) was above .70 in all 133 teams. In 56 teams, we had a group score based on at least Note. DE = Germany; BR = Brazil. χ 2 is the Chi-Square statistic (CMIN in Amos 22), and df is the respective number of degrees of freedom. p (χ 2 ) is the significance level of the χ 2 statistic, named P in Amos 22. TLI = Tucker-Lewis Index; RMSEA = Root Mean Square Error of Approximation; NFI = Normed Fit Index; GFI = Goodness-of-Fit Index; AGFI = Adjusted Goodness-of-Fit Index. n.a. = not available two team members, and data submitted by the leader as well.
In this sample, the Pearson correlation between the mean scores produced by the team members and the leader's score was r = .25 (p = .07). Apparently, the leaders had a positive bias on GD in their team: while the range of scores was [2.38, 4.50] for team members, it was [3.13, 5.00] for team leaders.

Convergent, Discriminant, and Concurrent Validity
As an additional evaluation of the GD scale's validity, we correlated the GD's sum-score to the scores resulting from the other instruments. We did this both at individual and at team level, since the aggregation of the data at team level had shown to be adequate, and since this allowed spotting possible inconsistencies across these levels.
The high correlation of the GD with the TCL dimension mutual trust demonstrates that it covers high-quality interpersonal relationships and proactive engagement towards team purposes. The GD's correlations with the SABKWSE items, reflecting group potency, and with the F-A-T factor solidarity were lower, yet relevant. The TCL factor democracy was the criterion variable that correlated lowest with the GD; democracy reflects a specific team process not directly covered by the GD instrument (Table 3). A very similar pattern was found when repeating the correlation analysis at team level (Table 4).
Concurrent validity of the GD was confirmed, showing high correlation (p < .05) with the criterion motivation and interest (Table 3, Table 4).

Discussion
The GD questionnaire is an attractive option for researchers who want to measure a groups' capacity to work together. It represents a psychosocial process that shows strong relationships with other theoretically related variables both in previous studies (entitativity, group identification) and in this study (solidarity, mutual trust, motivation and interest in the group's tasks, group potency). It also relates to relevant outcome variables (team effectiveness, order and hygiene). The GD does not suffer from the limitations of other instruments, since it measures a continuous variable and not developmental group stages, and since it covers more key facets of well-developed groups than, for example, group cohesiveness. Results also show evidence for the intercultural applicability of the GD.

Main Findings
The results confirmed the construct validity of the German version of the GD. Its unidimensional structure was reproduced in the CFA. The GD scores among team members allowed for aggregating scores at group level. The GD also showed convergent and discriminant validity, correlating as supposed to similar groups constructsboth at an individual and a group level. The high correlation of the GD with the motivation of team members to engage in the team's tasks confirmed the concurrent validity. Despite the high correlations among other group constructs, additional CFAs confirmed the measurement model and showed that the items used represented distinct variables. The instrument showed measurement invariance regarding gender and job type. The good internal consistency is in line with results from other countries. Except for a marginally insufficient AVE, the instrument met all defined quality criteria.
The results show that the construct GD is furthermore applicable across different cultures. The characteristics of welldeveloped groups form a single measurement dimension in Spanish, Brazilian, and German samples. This indicates a broad applicability of the GD and may be relevant for developing theoretical models in teamwork investigation.
Since interrater agreement varied a lot between teams, we concluded that the GD is primarily an individual-level measurement, even though aggregation at group level is possible if teams with low r wg(j) scores are eliminated from the sample.

Limitations
We used two samples from only one research & development organization. To verify that our results are generalizable, we recommend gathering more data from diverse organizations and sectors in Germany. The overrepresentation of male Notes. N = 133 teams (408 participants); mean number of answered questionnaires per team k = 3.06 employees and of researchers was not critical, since the model was invariant across gender and job types. Nonetheless, other not-represented grouping variables could have a yet undetected influence on the measurement model. Furthermore, single-source bias may have inflated the correlation between the GD and its criterion variable of concurrent validity. Future research should include external data as validation criteria of successful teamwork, such as financial results or efficiency measures based on stakeholders' opinions. AVE was probably low because the scale reflects four semantically different aspects of well-developed groups. Although these aspects form a single factor, the items were not formulated to be parallel. In our view, this does not jeopardize the GD's validity, nor its applicability.

Implications for Future Research and Practice
The GD has a solid theoretical basis and makes reference to the relevant streams of research on group development and regarding the question what defines a team or workgroup. It continues the research line on groupality and takes aspects from the construct families of entitativity and groupness into account. It has also shown a stable factorial structure across different cultures and good reliability. We recommend using the GD to analyze the processes that make team members work together effectively. It can serve as a process variable in IPO models, or as a mediator in the more recently recommended IMOI (Input-Mediator-Output-Input) models. We recommend researching the factors that have a positive impact on GD, such as factors related to the organization, the task, the working environment, the leadership, etc. Regarding the GD itself, we recommend translating the instrument into more languages and to continue the validation, since this would increase its applicability in cross-cultural research. The GD can help further research on the construct of group development, since it integrates the main shared properties of the three research branches on groupness, entitativity, and groupality in a single construct. In doing so, the GD does not mingle states and processes, as in the construct of entitativity (as defined by Lickel), to provide a defined and unidimensional assessment of group development as a process. Additionally, the GD overcomes the limitations of classic stage models of group development: in a sample as the one used in our study, which is mostly composed of teams that have a history of working together on projects in the past, models such as those presented by Tuckman and Jensen, or Wheelan and Hochberger, would be expected to differentiate rather poorly among such teams. Since the GD measures the extent to which the social element of a group of people working together has actually developed as a continuous measurement, we also gathered variance among not-recently formed teams. Moreover, previous research, cited above, has already shown that compared to Wheelan and Hochberger's (1996) GDQ, and that the GD explains additional variance of group effectiveness, and that the GD is a significant predictor for the ultimate manager goals of group performance and group effectiveness. In summary, the GD is particularly helpful to predict group performance and to be used when most teams are expected to be already on a similar stage (e.g., performing).
To practitioners, we recommend using the GD for evaluating work teams, both because the instrument is quick and easy to use and because it meets high standards regarding validity and reliability. Being based on several key criteria of group functioning, it gives practitioners more clues on what to when low scores are measured, compared to stage-based instruments and measurements of group cohesion. The GD is a good predictor of team effectiveness and thus of high relevance to HR experts and management.

Summary
Summarizing, we recommend the German version of the GD questionnaire and propose its translation to more languages.

Data Policy
The datasets generated during and analyzed during the current study are available from the corresponding author on reasonable request. statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.