1 Introduction

Self-concepts of ability are of great importance to students because they predict key outcomes such as achievement (e.g., Wolff et al., 2021), interest (e.g., Marsh et al., 2005), and career decisions (e.g., Jansen et al., 2021). Among the multiple factors involved in students’ self-concept formation, recent research has emphasized the significance of three comparisons, each the subject of a specific comparison theory: social comparisons (Festinger, 1954), where students compare their achievement with those of others, dimensional comparisons (Möller & Marsh, 2013), where students compare their achievement in different domains, and temporal comparisons (Albert, 1977), where students compare their achievement across time. Social, dimensional, and temporal comparisons have in common that they relate a target (i.e., the achievement of the person, in the domain, or at the time point of interest) with a standard (i.e., the achievement of the person, in the domain, or at the time point of reference). Moreover, they share that comparisons can be directed downward (i.e., the target is superior to the standard), laterally (i.e., the target is similar to the standard), or upward (i.e., the target is inferior to the standard).

To date, several studies have suggested that social, dimensional, and temporal downward comparisons usually increase students’ self-concepts, whereas social, dimensional, and temporal upward comparisons decrease students’ self-concepts (e.g., Wolff & Möller, 2022). However, most of these studies are non-experimental and thus do not allow drawing causal conclusions about the impact of the comparisons. Specifically, only three experimental studies so far have examined the joint effects of social, dimensional, and temporal comparisons on students’ assessments of their own abilities (Wolff et al., 2018b; Zell & Strickhouser, 2020). However, only one of these studies has found a significant effect of temporal comparisons (Wolff et al., 2018b).

One explanation for this inconsistency could be the fact that temporal comparisons in the experimental studies referred to achievement changes within a period of less than one hour, whereas temporal comparisons in the non-experimental studies referred to periods of months or even years. Therefore, in this research I conducted a longitudinal experiment in which students assessed their self-concepts after having received manipulated feedback on their achievement on two ability tests completed at two different measurement points at a distance of two weeks. Moreover, I investigated the joint effects of social, dimensional, and temporal comparisons on external ratings of ability. For this purpose, I asked students to assess a peer’s ability from both the peer’s perspective (“students’ inferred self-concepts of a peer”) and from their own perspective (“students’ own assessments of a peer’s ability”).

Overall, this study is of great relevance in both theoretical and practical respects. From a theoretical perspective, the investigation of the joint effects of social, dimensional, and temporal comparisons is particularly important with respect to the development of a comprehensive comparison theory of academic self-concept formation. As noted above, social, dimensional, and temporal comparisons each originate in a specific comparison theory. However, given the possible joint influences of these comparisons on students’ self-concepts, it would be desirable to integrate them into one comparison theory at some point. This comparison theory should also address the question of the extent to which the effects of social, dimensional, and temporal comparisons can be generalized to different kinds of ability assessment, and offer explanations for possible differences between comparison effects on self- and external assessments. To this end, it is beneficial to find out whether comparison effects on external ratings of ability depend on whether these ratings are made from the other person’s perspective (i.e., are inferences about others’ self-concepts) or from one’s own perspective (i.e., are one’s own assessments of others’ abilities), because such knowledge could explain possible differences between comparison effects on self- and external assessments of ability. For example, it would be possible that comparison effects on self- and external assessments differ because certain comparison information has different relevance in various kinds of ability ratings (e.g., as individuals also make comparisons to serve certain motivations, such as the desire to feel good; Wolff et al., 2018a). However, it is also conceivable that differences in comparison effects between self- and external assessments result from deficient abilities to put oneself in another person’s shoes and to intuit this person’s cognitive processes (e.g., as individuals conduct comparisons partly on an unconscious level; Wolff et al., 2020a).

In practical terms, this study is particularly relevant with regard to the role of temporal comparisons in the process of students’ self-concept formation. Given that comparisons influence students’ self-concepts, the question arises to what extent these comparisons could be used, for example, by teachers or parents as a means of promoting students’ self-concepts. In this regard, temporal comparisons are particularly promising, as in many cases it should be relatively easy for teachers to point their students toward achievement improvements (or at least competence gains); for instance, by applying an individualized teacher frame of reference (e.g., Helm et al., 2023). In contrast, it may be more difficult for teachers to counteract undesirable influences from social and dimensional comparisons, because teachers have limited influence on achievement differences in the classroom, which are particularly predictive of social comparison effects (e.g., Marsh et al., 2014), and because dimensional comparisons are a “double-edged sword” (Möller & Marsh, 2013, p. 546) in that positive effects of dimensional comparisons on self-concept in one domain are accompanied by negative effects on self-concept in the other domain. Beyond that, studying the influence of comparisons on different kinds of ability assessments is of high practical importance. For optimal promotion of students’ academic self-concepts, teachers should be able to replicate the cognitive processes their students go through when assessing their self-concepts. However, if teachers have limited or no ability to do this, it would be advisable to educate them about the influences of comparisons on different kinds of ability assessment.

2 Literature overview

2.1 Comparison effects on students’ self-concepts

In recent years, the joint effects of social, dimensional, and temporal comparisons on students’ self-concepts of ability have been investigated in several non-experimental studies. In particular, researchers have conducted 12 studies testing the 2I/E model (Wolff et al., 2019; see Wolff & Möller, 2022, for a meta-analysis). In this model, students’ math and verbal self-concepts are regressed on their math and verbal achievement levels and achievement changes during a specific period, and the effects of achievement levels and changes on self-concepts are interpreted as social, dimensional, and temporal comparison effects, respectively. In addition, some non-experimental studies have regressed students’ math and verbal self-concepts on their direct evaluations of their math or verbal achievements in comparison to their classmates, their achievement in the other domain, and their prior achievement in the same domain (Müller-Kalthoff et al., 2017a, Study 3; Wolff et al., 2018b, Study 2). Overall, these studies suggest that social, dimensional, and temporal comparisons are involved in the formation of students’ domain-specific self-concepts. Furthermore, they indicate that social comparisons have the strongest effect, followed by dimensional and then temporal comparisons. According to Wolff et al. (2018b), the relative strength of comparison effects might reflect the importance of the different types of comparison in modern societies and their salience in students’ everyday lives.

However, to interpret the comparison effects in a causal sense, it is necessary to replicate the findings from non-experimental studies using experimental designs. To date, researchers have conducted several experiments examining the joint effects of either social and dimensional comparisons (Möller & Köller, 2001; Pohlmann & Möller, 2006; Strickhouser & Zell, 2015) or social and temporal comparisons (Van Yperen & Leander, 2014; Zell & Alicke, 2009, 2010) on students’ self-concepts. In sum, these studies found support for the joint effects of both comparisons examined. Moreover, the social comparison effects were usually substantially stronger, which led Van Yperen and Leander (2014, p. 676) to speak about “the overpowering effect of social comparison information” (TOESCI).

Considering that the experimental studies examining two comparison types found significant effects of both comparisons on students’ domain-specific self-concepts, it would be plausible that experimental studies examining all three comparison types together would similarly find significant effects of all three comparisons. However, this was shown to be the case in only one of three studies (Wolff et al., 2018b, Study 1). In this study, students rated their ability to solve figure analogy tasks after they received manipulated feedback concerning their achievement in a figure analogies test. According to this feedback, their achievement in the figure analogies test was either better or worse than the mean achievement in a reference group in this test, than their achievement in a word analogies test, and in the second part of the figure analogies test compared to the first part. Consistent with the findings from most non-experimental studies, the social comparison feedback showed the strongest effect on students’ self-concept, followed by dimensional comparison feedback, then temporal comparison feedback.

Building on Wolff et al.’s (2018b) findings, Zell and Strickhouser (2020) examined the joint effects of social, dimensional, and temporal comparisons on students’ self-perceived ability to solve verbal reasoning tasks in two experimental studies. In addition to a verbal reasoning test, the students also worked on a quantitative reasoning test. Study 1 was conceptually similar to Wolff et al.’s (2018b) experiment. In Study 2, the students received achievement feedback that triggered an upward or downward comparison with respect to only one comparison type, while lateral comparisons were triggered with respect to the other two comparison types. Interestingly, in both studies the only significant comparison effects were social and dimensional, while the temporal comparison effects were nonsignificant. Moreover, in both studies the social comparison effects were significantly stronger than the other comparison effects.

One explanation for the contradictory findings between Wolff et al.’s (2018b) and Zell and Strickhouser’s (2020) experiments could be that the researchers investigated comparison effects on self-concepts in different domains. However, this argument becomes less plausible considering findings from non-experimental studies revealing significant effects of all three comparison types on students’ math and verbal self-concepts. Therefore, a more reasonable explanation for the lack of temporal comparison effects in Zell and Strickhouser’s (2020) studies could be the fact that achievement feedback in these studies referred to achievement changes over a period of a few minutes. Although this limitation also applies to Wolff et al.’s (2018b) experiment, the explanation is strengthened by the fact that Zell and Alicke (2009, 2010) found significant social and temporal comparison effects in two studies in which students completed a bogus social sensitivity test over the course of several weeks before their social sensitivity was assessed.

In summary, Zell and Strickhouser’s (2020) findings call into question the role of temporal comparisons in the process of students’ self-concept formation. Accordingly, there is a need for an experimental study that examines the joint effects of social, dimensional, and temporal comparisons based on achievement feedback that addresses more than one measurement point. Ideally, such a study should examine comparison effects on self-concepts in different domains to uncover potential dependencies on the domain under investigation.

2.2 Comparison effects on external ratings of ability

The findings on the influence of comparisons on ability assessments become even more ambiguous when a distinction is made between self- and external ratings. In the only non-experimental study that has examined the effects of social, dimensional, and temporal comparisons on self- and external ratings, Wolff et al. (2020b) found significant effects of all three types of comparison on students’ math and verbal self-concepts and parents’ assessments of their children’s math and verbal abilities. For both kinds of assessment the social comparison effects were stronger than the dimensional and temporal comparison effects. However, the effects of dimensional and temporal comparisons were significantly weaker for the external assessments.

Unlike Wolff et al. (2020b), most non-experimental studies examining the effects of social and dimensional comparisons on math and verbal ability ratings by others (Dai, 2002; Helm et al., 2018; Lösch et al., 2017; Marsh et al., 1984; Pohlmann et al., 2004; Van Zanden et al., 2017) have found no significant dimensional contrast effects (i.e., the higher ability ratings in one domain the lower the achievement in the other) or have even found dimensional assimilation effects (i.e., the higher ability ratings in one domain the higher the achievement in the other). However, it is possible that the direction and strength of dimensional comparison effects depend on the kind of ability rating. For example, Van Zanden et al. (2017) found significant dimensional contrast effects when parents were asked to infer their children’s self-concepts, whereas the dimensional comparison effects were nonsignificant when the parents rated their children’s abilities from their own perspective.

In experimental studies, dimensional comparison effects on ability ratings by others have so far been investigated for inferred self-concepts only. In particular, Müller-Kalthoff et al., (2017a, Studies 1–2) examined dimensional comparison effects along with social and temporal comparison effects in two vignette studies. It is interesting that, while the social comparison effects were the strongest, the temporal comparison effects exceeded the dimensional comparison effects in both studies. This finding is consistent with the findings of Zell and Alicke (2010), who found a stronger influence of temporal upward comparisons when students assessed the social sensitivity of a peer instead of their own. However, it contradicts the findings of Zell and Alicke (2009), who found no significant temporal comparison effect when students’ social sensitivity was assessed by their peers.

To sum up, the findings on the impact of comparisons on external ability assessments are rather inconsistent. In particular, there is a lack of experimental studies examining the joint effects of social, dimensional, and temporal comparisons on one’s own self-concepts, one’s inferred self-concepts of another person, and one’s own ratings of another person’s abilities.

3 The present research

This study is the first to investigate the joint effects of social, dimensional, and temporal comparisons on students’ own self-concepts, their inferred self-concepts of a peer, and their own assessments of a peer’s abilities in the figural and verbal domains, using a longitudinal experimental design. In particular, this approach aimed to generate substantive knowledge about the impact of temporal comparisons in students’ self-concept formation and about possible differences between comparison effects on different kinds of ability assessment.

On the basis of previous findings, and given the longitudinal design of this study, I expected that social comparisons (Hypothesis 1.1), dimensional comparisons (Hypothesis 2.1), and temporal comparisons (Hypothesis 3.1) would show significant effects on students’ own self-concepts, indicating higher self-concepts following downward and lower self-concepts following upward comparisons. I also expected that social comparisons (Hypothesis 1.2), dimensional comparisons (Hypothesis 2.2), and temporal comparisons (Hypothesis 3.2) would show significant effects on students’ inferred self-concepts of a peer, indicating higher inferred self-concepts following downward and lower inferred self-concepts following upward comparisons. However, for students’ own assessments of a peer’s abilities, I only predicted significant social comparison effects (Hypothesis 1.3), indicating higher ability assessments following downward and lower ability assessments following upward comparisons, while examining the effects of dimensional and temporal comparisons exploratorily.

In line with TOESCI, I further expected that the social comparison effects would be significantly stronger than the dimensional comparison effects on students’ own self-concepts (Hypothesis 4.1), on their inferred self-concepts of a peer (Hypothesis 4.2), and on their own assessments of a peer’s abilities (Hypothesis 4.3). Similarly, I assumed that the social comparison effects would be significantly stronger than the temporal comparison effects on students’ own self-concepts (Hypothesis 5.1), on their inferred self-concepts of a peer (Hypothesis 5.2), and on their own assessments of a peer’s abilities (Hypothesis 5.3).

4 Method

4.1 Overview

The study design was similar to that of Zell and Strickhouser (2020, Study 2). At two measurement points (T1 and T2), approximately two weeks apart, students worked on a figural ability test (FAT) and a verbal ability test (VAT). At T2, they received manipulated achievement feedback that triggered two lateral comparisons and one downward, lateral, or upward comparison with regard to their achievement in at least one test (see Table 1). On the basis of this feedback, the students rated their ability to solve FAT and VAT tasks (i.e., their FAT and VAT self-concepts). Similarly, the students also received feedback about the achievement of a (fictitious) peer who had supposedly worked on the two ability tests, and then rated this peer’s ability to solve FAT and VAT tasks from their own perspective (inferred self-concepts of the peer) and the peer’s perspective (own assessments of the peer’s abilities). Data collection took place online in small groups and lasted almost 2 h per measurement point.

Table 1 Feedback conditions

To examine the effects of social, dimensional, and temporal comparisons, I compared the ability ratings between the two experimental conditions triggering downward versus upward comparisons with regard to the respective comparison type, along with lateral comparisons for the other two comparison types. This approach was similar to all three previous experimental studies that investigated the joint effects of social, dimensional, and temporal comparisons, in which the researchers also examined comparison effects by comparing experimental conditions triggering upward versus downward comparisons (Wolff et al., 2018b, Study 1; Zell & Strickhouser, 2020, Studies 1 and 2). Thus, my approach allowed a direct comparison of this study's findings with those from the previous studies. In addition, I preferred the approach of operationalizing comparison effects by comparing experimental conditions triggering upward and downward comparisons to an alternative approach of comparing experimental conditions triggering upward and lateral comparisons as well as downward and lateral comparisons, because my approach was likely to have the higher power (since students’ ability ratings should show stronger differences after upward versus downward comparisons than after upward versus lateral or downward versus lateral comparisons). Nevertheless, for additional analyses, the design of this study also included a condition triggering lateral comparisons for all three comparison types (Condition 1 in Table 1).

4.2 Sample

Participants were N = 433 students from 63 German universities and universities of applied science. Of these, n = 411 students (94.9%) participated at both measurement points. The mean age in this sample was 23.3 years (SD = 4.71). Most of the students were female (76.9%), German (96.1%), and psychology students (68.6%). Other majors that were nominated by at least five students (i.e., more than 1%) were sociology (17.5%; 15.6% in combination with psychology), management/business administration (10.5%; 7.5% in combination with psychology), German studies (6.1%; 0.2% in combination with psychology), English studies (4.9%; 1.2% in combination with psychology), education sciences/pedagogy (2.9%), mathematics (2.9%), biology (2.4%; 0.2% in combination with psychology), bio-geosciences (1.2%), computational visualistics (1.2%), philosophy (1.2%), and sport sciences (1.2%). A series of t-tests showed that psychology students and non-psychology students differed neither in their FAT or VAT achievements nor in their FAT or VAT self-concepts (all t < 1, all p ≥ .34). Of the students who participated at both measurement points, between n = 114 and n = 152 students were further excluded from the analyses because of manipulation check failures or technical problems with data storage (see Table S1 in the Supplemental Material for detailed subsample descriptions). Still, these sample sizes allowed for the detection of small-to-medium effects of f2 > .04 with a power of > 95%.

Students were recruited mainly via online forum postings and announcements in lectures. As a thank for their participation, they had the chance to win vouchers of a total value of about $3,000. Moreover, psychology students received course credits. Participation was voluntary. Participants were randomly assigned to the experimental conditions and data were collected until each condition included at least 40 participants.

4.3 Materials and variables

4.3.1 Ability tests

The ability tests were composed of the three figural (FAT) and verbal (VAT) subtests of the German Intelligence Structure Test 2000 R (I-S-T 2000 R; Liepmann et al., 2007). Each of these subtests consisted of 20 items and was available in two parallel versions for use at T1 (Version A) and T2 (Version C). Two subtests had already been used in previous experiments to examine the influence of comparisons on domain-specific self-concepts (Möller & Köller, 2001; Wolff et al., 2018b). In the present study, I included additional subtests to increase the credibility of the manipulated feedback. The correlations between students’ raw scores in the different figural and verbal subtests at one measurement point ranged from r = .33 to r = .51; those between their raw scores in the same subtests at the different measurement points from r = .45 to r = .67.

4.3.2 Independent variables

The students received two sets of manipulated achievement feedback: one on their own achievement and one on a peer’s achievement. Both sets of feedback were randomly assigned and independent of each other. As shown in Table 1, each set of achievement feedback related to one of the nine experimental conditions and was given in the form of percentile ranks. These percentile ranks indicated what percentage of students in a reference group performed worse than the students themselves, or the peer, respectively. The percentile ranks were varied systematically in relation to Condition 1 (three lateral comparisons): To trigger social downward comparisons, all percentile ranks were increased by 20%, whereas all percentile ranks were decreased by 20% to trigger social upward comparisons. To trigger dimensional downward comparisons, both percentage ranks in the standard domain were decreased by 20%, whereas both percentage ranks in the standard domain were increased by 20% to trigger dimensional upward comparisons. To trigger temporal downward comparisons, both percentage ranks at T1 were decreased by 10% and both percentage ranks at T2 were increased by 10%, whereas both percentage ranks at T1 were increased by 10% and both percentage ranks at T2 were decreased by 10% to trigger temporal upward comparisons. Thus, each upward and downward comparison referred to an achievement difference of 20% (between the student’s/peer’s achievement and the average achievement in the reference group, between the achievement in the FAT and the achievement in the VAT, or between the achievement at T1 and the achievement at T2). This achievement difference was consistent with Zell and Strickhouser (2020, Study 2). However, to increase the credibility of the manipulated feedback, all percentile ranks were also varied randomly by up to 2%. Figure 1 shows an example of the feedback a student could have received during the study.

Fig. 1
figure 1

Example of achievement feedback. The feedback on the student’s own achievement (above) and the achievement of the peer (below) was presented sequentially and in random order. The order of feedback about achievement in the FAT and in the VAT also varied. For each set of feedback, one of the nine experimental conditions was randomly selected. Moreover, each percentile rank was randomly varied by up to 2%. In the example shown, the feedback on the student’s own achievement was based on Condition 2. The feedback on the achievement of the peer was based on Condition 4. This feedback has been freely translated from German into English

4.3.3 Dependent variables

Students’ domain-specific own self-concepts, inferred self-concepts of a peer, and own assessments of a peer’s abilities were measured using six analogous items that had been used successfully in previous self-concept research (e.g., Wolff et al., 2018b). These items were adapted slightly for the three different kinds of ability assessment. For example, one item read “Tasks like those in the figural/verbal ability test are easy for me” (students’ own self-concept), “My fellow student thinks: Tasks like those in the figural/verbal ability test are easy for me” (students’ inferred self-concept of a peer), or “Tasks like those in the figural/verbal ability test are easy for my fellow student” (students’ own assessment of a peer’s ability), respectively. Students responded to each item on a Likert scale ranging from 1 = strongly disagree to 6 = strongly agree. Reliabilities of all scales were very high (all .90 ≤ α ≤ .94). A complete list of all items and reliabilities can be found in Table S2 in the Supplemental Material.

4.3.4 Manipulation checks

As a manipulation check, students were asked to rate their own and the peer’s achievement in the FAT and VAT in comparison to the reference group, to the other test, and at T2 compared to T1, on a Likert scale ranging from 1 = much worse to 9 = much better (see Table S3 in the Supplemental Material for exact item formulations). Students were excluded from the respective analyses if they did not tick values above 5 when downward comparisons were induced, below 5 when upward comparisons were induced, or between 3 and 7 when lateral comparisons were induced. Overall, this led to the exclusion of n = 152 students (37.0%) in the analyses examining the comparison effects on students’ FAT self-concept, n = 136 students (33.1%) in the analyses examining the comparison effects on students’ VAT self-concept, n = 113 students (27.5%) in the analyses examining the comparison effects on students’ inferred FAT self-concept of a peer and students’ own assessment of a peer’s ability to solve FAT tasks, and n = 117 students (28.5%) in the analyses examining the comparison effects on students’ inferred VAT self-concept of a peer and students’ own assessment of a peer’s ability to solve VAT tasks.

Table 2 presents how many students responded to the single items used for the manipulation check in a way that was consistent with the experimental manipulation. As shown, students responded to the items more consistently with the manipulation if the items referred to achievement ratings in temporal comparison rather than social comparison (Δ = 4.0%) or dimensional comparison (Δ = 10.8%), and to achievement ratings in social comparison rather than dimensional comparison (Δ = 6.8%). Furthermore, the students responded to the items more consistently with the manipulation if the items referred to their peer’s achievement rather than their own achievement (Δ = 4.4%). The domain to which the items referred was not substantially related to the frequency of responses consistent with the manipulation (Δ = 0.8%).

Table 2 Percentage of students responding to single items used for the manipulation check consistently with the experimental manipulation

Although the students did not know their actual achievement at the time of the manipulation check (see Sect. 4.4), it is conceivable that they developed an impression of their actual achievement as they worked through the ability tests, which could have affected their achievement ratings in addition to the manipulated achievement feedback. Therefore, I conducted additional analyses investigating whether students’ ratings of their own achievement were more likely to fail the manipulation check, the more that students’ purported achievement differed from their actual achievement. However, there was little evidence, at best, for this assumption.

On the one hand, I found no significant relations between a “discrepancy variable”, which I calculated by summing the absolute values of the four differences between the corresponding purported and actual percentile ranks in the two ability tests at the two measurement points (the actual percentile ranks being calculated using the age-specific norms from the manual of the I-S-T 2000 R; Liepmann et al., 2007), and the two variables that indicated whether the students were included (coded as 1) or not included (coded as 0) in the analyses examining comparison effects on their FAT self-concept (r = –.06, p = .20) and VAT self-concept (r = –.07, p = .17). On the other hand, I mostly found no significant relations between the variables indicating whether students’ achievement ratings in social, dimensional, and temporal comparisons were consistent or inconsistent with the experimental manipulation and relevant achievement variables (see Table 3). Only students in the experimental conditions that triggered social downward comparisons (i.e., Conditions 2 and 5 or 7) were more likely to rate their achievement in one test in social comparison consistently with the manipulation (i.e., to tick values above 5) the better they actually performed in the respective test, and students in the experimental condition that triggered temporal downward comparisons (i.e., Condition 8) were more likely to rate their achievement in one test in temporal comparison consistently with the manipulation (i.e., to tick values above 5) the more they actually improved (or the less they actually worsened) in the respective test (however, for the FAT, these relations were only statistically significant at the 10% significance level).

Table 3 Correlations between successful manipulation and students’ actual achievement differences

4.4 Procedure

The study was conducted online via web conferencing in BigBlueButton between November 2020 and July 2022. Students participated in small groups. Each session lasted almost two hours and was moderated by an experimenter who was available to answer students’ questions at any time. When registering for the study, students selected two participation dates about two weeks apart. They were then sent a link to the BigBlueButton room. Data were collected via Limesurvey questionnaires. The links to these questionnaires were sent within the BigBlueButton sessions. Individual identification numbers were used to merge the data from both measurement points.

At the beginning of the first session, students were told as a cover story that the goal of the study was to test digital versions of figural and verbal ability tests that had been used for several years in paper-and-pencil versions. They were also informed that the figural and verbal tests existed in two parallel forms that were identical in terms of task type and difficulty. In each session, they would complete one parallel form of each test. At the end of the second session, they would receive feedback on their achievement in both tests from both measurement points. This feedback would be generated automatically so that neither the examiner nor the other participants would know the individual results. Moreover, the students would receive feedback about the achievement of another (anonymous) student of the group.

Following this introduction, the students were asked to answer some demographic questions. Subsequently, the experimenter presented them with sample items from the FAT and VAT subtests that were solved as a group practice exercise. After all questions about the subtests were clarified, the students were asked to rate their perceived ability to solve FAT and VAT tasks, based on the sample items presented. Then, the students worked on the six subtests of the first parallel version. In line with the test manual, they had between 6 and 10 min to complete each subtest. A countdown onscreen showed the time remaining. Before the start of each subtest, the respective instructions were repeated within a 30-s break. After having completed all subtests, the students worked on some items that were not relevant for the present study.

At the beginning of the second session, the different task types of the FAT and VAT were briefly revised. Subsequently, the students worked on the six subtests of the second parallel version (similar to T1). After that, the experimenter prepared the students for the achievement feedback by explaining how to interpret a percentile rank and interpreting some arbitrary percentile ranks together with the students as a group. As soon as all students had understood how to interpret a percentile rank, they successively received the manipulated feedback on their own achievement and that of the peer. Following each feedback occasion, they were asked to assess their own or the peer’s ability to solve FAT and VAT tasks (the peer’s ability from their own and the peer’s perspective) and to rate their own or the peer’s achievement on the two ability tests, respectively. During these rating processes the relevant percentile ranks were shown. After all students had assessed both their own and the peer’s abilities and achievements, they were informed about the actual purpose of the study. They were also told that the received achievement feedback had been manipulated, and finally received their true results in the ability tests.

The order in which students worked on the FAT and VAT and the test-specific items, as well as the order in which the students received feedback about their own and the peer’s achievement, varied randomly between the groups. Furthermore, it was randomized whether the students should assess the peer’s abilities first from their own or from the peer’s perspective. All items of one scale were also presented in randomized order.

4.5 Statistical analyses

I tested my hypotheses using regression analyses in Mplus 7.4 (Muthén & Muthén, 2015). For model estimation, I used the robust maximum likelihood estimator (MLR). Overall, I specified six models, in which students’ ability assessments (at T2) were regressed on eight dummy variables, indicating the experimental Conditions 2–9 (i.e., Condition 1 served as the reference condition). Figures 2, 3, 4 and 5 illustrate these different models. As shown, Model 1a and Model 1b included students’ own FAT self-concept as criterion. These models only differed from each other in that in Model 1b I also controlled for students’ FAT self-concept at T1. Similarly, Model 2a and Model 2b included students’ own VAT self-concept as criterion and they differed from each other only in that in Model 2b students’ VAT self-concept at T1 was also controlled for. Model 3 included two criteria: students’ inferred FAT self-concept of a peer and students’ own assessment of a peer’s ability to solve FAT tasks. Thus, this model allowed me to test for significant differences in the strength of the same comparison effects on the two different kinds of ability ratings, made by the same participants based on the same information about the peer’s achievement. Similarly, Model 4 included both students’ inferred VAT self-concept of a peer and students’ own assessment of a peer’s ability to solve VAT tasks as criteria.

Fig. 2
figure 2

FAT = Figural Ability Test. VAT = Verbal Ability Test. Students’ FAT self-concept used as criterion was assessed at the second measurement point (T2). Students’ FAT self-concept used as predictor in Model 1b was assessed at the first measurement point (T1). The “ + ” after the hypotheses indicates that the respective effects are assumed to be positive

Model 1a und Model 1b: prediction of students’ own FAT self-concept before (Model 1a) and after (Model 1b) controlling for prior FAT self-concept.

Fig. 3
figure 3

FAT = Figural Ability Test. VAT = Verbal Ability Test. Students’ VAT self-concept used as criterion was assessed at the second measurement point (T2). Students’ VAT self-concept used as predictor in Model 1b was assessed at the first measurement point (T1). The “ + ” after the hypotheses indicates that the respective effects are assumed to be positive

Model 2a und Model 2b: prediction of students’ own VAT self-concept before (Model 2a) and after (Model 2b) controlling for prior VAT self-concept.

Fig. 4
figure 4

FAT = Figural Ability Test. VAT = Verbal Ability Test. The criteria (i.e., students’ inferred FAT self-concept of a peer and students’ assessment of a peer’s ability to solve FAT tasks) were assessed at the second measurement point (T2). The “ + ” after the hypotheses indicates that the respective effects are assumed to be positive

Model 3: prediction of students’ inferred FAT self-concept of a peer and students’ own assessment of a peer’s ability to solve FAT tasks.

Fig. 5
figure 5

FAT = Figural Ability Test. VAT = Verbal Ability Test. The criteria (i.e., students’ inferred VAT self-concept of a peer and students’ assessment of a peer’s ability to solve VAT tasks) were assessed at the second measurement point (T2). The “ + ” after the hypotheses indicates that the respective effects are assumed to be positive

Model 4: prediction of students’ inferred VAT self-concept of a peer and students’ own assessment of a peer’s ability to solve VAT tasks.

To calculate the comparison effects as defined in Sect. 4.1, I used the Model Constraint option implemented in Mplus to subtract the effects of the conditions that triggered downward and upward comparisons of the same comparison type (see Figs. 2, 3, 4 and 5). Thus, positive differences of the two effects subtracted from each other would indicate significantly higher ability ratings after downward comparisons as opposed to upward comparisons, and thus significant comparison effects, as predicted in Hypotheses 1–3. Nonetheless, to further understand the emergence of potential comparison effects, I also looked at the effects of the dummy variables, which indicated differences in the ability ratings after downward or upward comparisons in relation to lateral comparisons.

I also used the Model Constraint option to compare the strength of the social, dimensional, and temporal comparison effects on the same ability ratings. This allowed me to test Hypotheses 4 and 5. Moreover, I used the Model Constraint option in Model 3 and Model 4 to compare the strength of the effects of the same comparison types on students’ inferred self-concept of a peer and their own assessment of a peer’s ability. When comparing the strength of different comparison effects, I referred to the absolute values of the comparison effects. I tested all hypothesized effects one-sided (including effects of the dummy variables that contributed to the comparison effects in the expected direction). All other effects were tested two-sided.

I specified the ability ratings as manifest variables to enhance the comparability of the findings across the different models and with the descriptive statistics. In line with that, the three prior experimental studies examining the joint effects of social, dimensional, and temporal comparisons on students’ self-concepts (Wolff et al., 2018b; Zell & Strickhouser, 2020) had also specified students’ self-concepts as manifest variables. Nevertheless, I note that all my main findings (i.e., all findings concerning the significance of the comparison effects and the comparison of the comparison effects) were also replicated in additional analyses, in which the ability ratings were specified as latent variables. The results of these analyses can be found in Tables S4–S7 in the Supplemental Material. Moreover, the Supplemental Material sets out the syntaxes used to calculate the models in Mplus.

5 Results

5.1 Preliminary analyses

Table 4 shows the means and standard deviations of the ability assessments in the different experimental conditions. Since students were randomly assigned to the experimental conditions, the mean self-concepts at T1 should not differ between the groups. However, although one-way ANOVAs revealed no significant differences between the groups regarding students’ initial FAT self-concept, F(8, 250) = 1.96, p ≥ .05, and initial VAT self-concept, F(8, 266) = 0.50, p = .86, post-hoc tests indicated that students in the DimDownFAT condition had a significantly higher initial FAT self-concept than students in the Reference (∆M = 0.57, p = .04), DimUpFAT (∆M = 0.85, p < .01), DimDownVAT (∆M = 0.74, p = .01), DimUpVAT (∆M = 0.90, p < .01), and TemDown (∆M = 0.70, p = .01) conditions. Moreover, students in the SocDown condition had a significantly higher initial FAT self-concept than students in the DimUpFAT (∆M = 0.55, p = .04) and DimUpVAT (∆M = 0.60, p = .04) conditions (all other ∆|M|≤ 0.47, all p ≥ .10). For the investigation of comparison effects, the difference between the DimDownFAT and DimUpFAT conditions was somewhat problematic, as these two conditions were compared with each other to examine the dimensional comparison effect on students’ FAT self-concept. Accordingly, it was reasonable to conduct an additional analysis in which students’ initial FAT self-concept was controlled for (i.e., Model 1b) to test whether a potential dimensional comparison effect on students’ FAT self-concept at T2 resulted from differences in their FAT self-concept at T1 rather than from the experimental manipulation.

Table 4 Means and standard deviations of ability assessments and sample sizes in the different experimental conditions

5.2 Comparison effects on students’ own self-concepts

Table 5 presents the results of the regression analyses predicting students’ FAT self-concept (Model 1a and Model 1b). Table 6 shows the results of the regression analyses predicting students’ VAT self-concept (Model 2a and Model 2b).

Table 5 Results of the regression analyses predicting students’ own FAT self-concept (Model 1a and Model 1b)
Table 6 Results of the regression analyses predicting students’ own VAT self-concept (Model 2a and Model 2b)

In line with Hypothesis 1.1, there were significant social comparison effects both before and after controlling for prior self-concepts, indicating higher self-concepts following social downward comparisons and lower self-concepts following social upward comparisons (all 1.11 ≤ B ≤ 1.28, all p < .001, all 0.38 ≤ β ≤ 0.49). As predicted in Hypothesis 2.1, the dimensional comparison effects were significant in the analyses that did not control for prior self-concepts, indicating higher self-concepts following dimensional downward comparisons and lower self-concepts following dimensional upward comparisons (all 0.33 ≤ B ≤ 0.42, all p ≤ .08, all 0.12 ≤ β ≤ 0.13). However, after controlling for prior self-concepts, only the dimensional comparison effect on students’ VAT self-concept remained significant (B = 0.32, p = .03, β = 0.12), whereas the dimensional comparison effect on students’ FAT self-concept lost significance (B = –0.06, p = .78). In contrast to Hypothesis 3.1, none of the temporal comparison effects were significant, either before or after controlling for prior self-concepts (all 0.12 ≤|B|, all p ≥ .35).

The comparisons of the different kinds of comparison effects revealed that the social comparison effects were significantly stronger than the dimensional comparison effects in all four analyses (all 0.79 ≤ B ≤ 1.15, all p ≤ .01, all 0.27 ≤ β ≤ 0.38), which supported Hypothesis 4.1. Similarly, and in accord with Hypothesis 5.1, in all four analyses the social comparison effects were significantly stronger than the temporal comparison effects (all 0.99 ≤ B ≤ 1.26, all p < .001, all 0.34 ≤ β ≤ 0.47). The dimensional and temporal comparison effects did not differ from each other in respect to strength in any of the analyses (all 0.39 ≤|B|, all p ≥ .20).

An exploratory inspection of the effects of the dummy variables showed that in all four analyses students’ self-concepts were significantly higher in the SocDown condition than in the Reference condition (all 0.52 ≤ B ≤ 0.70, all p < .01, all 0.17 ≤ β ≤ 0.21), whereas they were significantly lower in the SocUp condition than in the Reference condition (all –0.72 ≤ B ≤ –0.59, all p < .01, all –0.29 ≤ β ≤ –0.18). Students’ FAT self-concept did not differ significantly between the Reference condition and the DimDownFAT condition or between the Reference condition and the DimUpFAT condition (all 0.27 ≤|B|, all p ≥ .16). Their VAT self-concept did not differ significantly between the Reference condition and the DimDownVAT condition (all 0.09 ≤|B|, all p ≥ .50), but was significantly lower in the DimUpVAT condition compared to the Reference condition (all –0.30 ≤ B ≤ –0.23, all p ≤ .05, all –0.11 ≤ β ≤ –0.08). The differences between students’ self-concepts in the Reference condition and the TemDown and TemUp conditions were nonsignificant (all 0.20 ≤|B|, all p ≥ .18), except for a significantly lower FAT self-concept in the TemUp condition after controlling for prior FAT self-concept (B = –0.23, p = .08, β = –0.08).

5.3 Comparison effects on students’ inferred self-concepts of a peer and students’ own assessments of a peer’s ability

Table 7 shows the results of the regression analyses predicting students’ inferred FAT self-concept of a peer and students’ own assessment of a peer’s ability to solve FAT tasks (Model 3). Tables 8 presents the results of the regression analyses predicting students’ inferred VAT self-concept of a peer and students’ own assessment of a peer’s ability to solve VAT tasks (Model 4).

Table 7 Results of the regression analyses predicting students’ inferred FAT self-concept of a peer and student’s own assessment of a peer’s ability to solve FAT tasks (Model 3)
Table 8 Results of the regression analyses predicting students’ inferred VAT self-concept of a peer and student’s own assessment of a peer’s ability to solve VAT tasks (Model 4)

In line with Hypothesis 1.2, there were significant social comparison effects on students’ inferred self-concepts of a peer, indicating higher inferred self-concepts following social downward comparisons and lower inferred self-concepts following social upward comparisons (all 1.69 ≤ B ≤ 1.82, all p < .001, all 0.62 ≤ β ≤ 0.63). Moreover, and as predicted in Hypothesis 1.3, there were significant social comparison effects on students’ own assessments of a peer’s ability, indicating higher ability assessments following social downward comparisons and lower ability assessments following social upward comparisons (all 2.09 ≤ B ≤ 2.16, all p < .001, all 0.69 ≤ β ≤ 0.76). Hypothesis 2.2 also found support: There were significant dimensional comparison effects on students’ inferred self-concepts of a peer, indicating higher inferred self-concepts following dimensional downward comparisons and lower inferred self-concepts following dimensional upward comparisons (all 0.40 ≤ B ≤ 0.92, all p ≤ .03, all 0.13 ≤ β ≤ 0.29). Similarly, there were significant dimensional comparison effects on students’ own assessments of a peer’s ability, indicating higher ability assessments following dimensional downward comparisons and lower ability assessments following dimensional upward comparisons (all 0.34 ≤ B ≤ 0.51, all p ≤ .03, all 0.10 ≤ β ≤ 0.17). However, as opposed to Hypothesis 3.2, the temporal comparison effect on students’ inferred FAT self-concept was significant, but indicated a higher inferred FAT self-concept following a temporal upward comparison and a lower inferred FAT self-concept following a temporal downward comparison (B = –0.32, p = .04, β = –0.11), while the temporal comparison effect on students’ inferred VAT self-concept was nonsignificant (B = –0.05, p = .77). The temporal comparison effects on students’ own assessments of a peer’s ability to solve FAT or VAT tasks were also nonsignificant (all 0.19 ≤|B|, all p ≥ .20).

The comparisons of the different kinds of comparison effects on the same ability rating revealed that the social comparison effects were significantly stronger than the dimensional comparison effects, both on students’ inferred self-concepts of a peer (all 0.90 ≤ B ≤ 1.29, all p ≤ .001, all 0.33 ≤ β ≤ 0.51), which supported Hypothesis 4.2, and on their own assessments of a peer’s ability (all 1.58 ≤ B ≤ 1.82, all p ≤ .001, all 0.52 ≤ β ≤ 0.66), which supported Hypothesis 4.3. Furthermore, the social comparison effects were significantly stronger than the temporal comparison effects, both on students’ inferred self-concepts of a peer (all 1.50 ≤ B ≤ 1.64, all p ≤ .001, all 0.51 ≤ β ≤ 0.61), which supported Hypothesis 5.2, and on their own assessments of a peer’s ability (all 1.90 ≤ B ≤ 2.11, all p ≤ .001, all 0.63 ≤ β ≤ 0.75), which supported Hypothesis 5.3. The dimensional and temporal comparison effects mostly did not differ from each other in terms of strength (all 0.36 ≤|B|, all p ≥ .13). Only the dimensional comparison effect on students’ inferred FAT self-concept of a peer was significantly stronger than the temporal comparison effect on students’ inferred FAT self-concept of a peer (B = 0.60, p = .01, β = 0.18).

Three significant differences also emerged between the strength of the same comparison effects on either students’ inferred self-concept of a peer or their own assessment of a peer’s ability. First, the social comparison effect on students’ inferred FAT self-concept of a peer was significantly weaker than the social comparison effect on students’ own assessment of a peer’s ability to solve FAT tasks (B = –0.27, p = .02, β = –0.07). Second, the social comparison effect on students’ inferred VAT self-concept of a peer was significantly weaker than the social comparison effect on students’ own assessment of a peer’s ability to solve VAT tasks (B = –0.47, p < .01, β = –0.13). Third, the dimensional comparison effect on students’ inferred FAT self-concept of a peer was significantly stronger than the dimensional comparison effect on students’ own assessment of a peer’s ability to solve FAT tasks (B = 0.41, p = .02, β = 0.12). No significant differences emerged between the strength of the dimensional comparison effects on students’ inferred VAT self-concept of a peer and on their own assessment of a peer’s ability to solve VAT tasks, or between the strength of the temporal comparison effects on students’ inferred self-concept of a peer and on their own assessment of a peer’s ability in the same domain (all 0.13 ≤|B|, all p ≥ .30).

An exploratory analysis of the effects of the dummy variables showed that for all four ability ratings examined in Model 3 and Model 4, students’ ability ratings were significantly higher in the SocDown condition than in the Reference condition (all 0.52 ≤ B ≤ 0.77, all p < .01, all 0.19 ≤ β ≤ 0.30), whereas they were significantly lower in the SocUp condition than in the Reference condition (all –1.47 ≤ B ≤ –1.14, all p < .001, all –0.46 ≤ β ≤ –0.41). Students’ inferred FAT self-concept of a peer and students’ own assessment of a peer’s ability to solve FAT tasks did not differ significantly between the Reference condition and the DimDownFAT condition (all 0.25 ≤|B|, all p ≥ .12). In contrast, these ability ratings were significantly lower in the DimUpFAT condition compared to the Reference condition (all –0.67 ≤ B ≤ –0.61, all p < .01, all –0.22 ≤ β ≤ –0.20). Similarly, students’ inferred VAT self-concept of a peer and students’ own assessment of a peer’s ability to solve VAT tasks did not differ significantly between the Reference condition and the DimDownVAT condition (all 0.06 ≤|B|, all p ≥ .71), but were significantly lower in the DimUpVAT condition compared to the Reference condition (all –0.38 ≤ B ≤ –0.29, all p ≤ .10, all –0.12 ≤ β ≤ –0.08). Students’ inferred FAT self-concept of a peer and students’ own assessment of a peer’s ability to solve FAT tasks were significantly lower in the TemDown condition compared to the Reference condition (all –0.40 ≤ B ≤ –0.33, all p ≤ .04, all –0.13 ≤ β ≤ –0.10). Students’ inferred VAT self-concept of a peer and students’ own assessment of a peer’s ability to solve VAT tasks did not differ between the TemDown condition and the Reference condition (all 0.27 ≤|B|, all p ≥ .07). Similarly, students’ inferred FAT self-concept of a peer and their own assessment of a peer’s ability to solve FAT tasks, as well as students’ inferred VAT self-concept of a peer and their own assessment of a peer’s ability to solve VAT tasks, did not differ between the TemUp condition and the Reference condition (all 0.25 ≤|B|, all p ≥ .11).

6 Discussion

6.1 Main findings

The present study aimed to examine the joint effects of social, dimensional, and temporal comparisons on students’ own self-concepts, students’ inferred self-concepts of a peer, and students’ own assessments of a peer’s abilities. In accord with previous research, I found significant social and dimensional comparison effects on the different kinds of ability assessment. Furthermore, supporting TOESCI, all social comparison effects were significantly stronger than the other comparison effects.

The dimensional comparison effect on students’ own FAT self-concept lost its significance after controlling for students’ prior FAT self-concept. However, this finding can be explained by the fact that students in the condition triggering dimensional downward comparisons from the perspective of the FAT had already shown an increased FAT self-concept at the beginning of the study. Although this study does not provide clear evidence for the dimensional comparison effect on students’ FAT self-concept, it is plausible that the manipulation to trigger the dimensional comparison effect on students’ FAT self-concept showed no effect because of the pre-existing self-concept differences, which have limited the scope for further divergence in students’ FAT self-concept. In any case, the absence of the dimensional comparison effect on students’ FAT self-concept after controlling for their prior FAT self-concept should not be interpreted as evidence against the existence of dimensional comparison effects. This is especially true in light of the fact that all other dimensional comparison effects examined in this research were significant.

Despite the longitudinal design of the present study, the temporal comparison effects were mostly nonsignificant. This finding contradicts the findings of Wolff et al.’s (2018b) experimental study, Müller-Kalthoff et al.’s (2017a) vignette studies, and most non-experimental studies investigating the joint effects of social, dimensional, and temporal comparisons on students’ self-concepts at school (e.g., Wolff & Möller, 2022). Still, it is consistent with the findings of the two experimental studies by Zell and Strickhouser (2020), and thus further calls into question the role of temporal comparisons in the formation of ability beliefs. Nevertheless, on the basis of this study’s findings and those of Zell and Strickhouser (2020), it would be premature to conclude that temporal comparisons do not have an impact on the formation of students’ ability beliefs. Rather, it is conceivable that the nonsignificant temporal comparison effects, as well as the significant temporal comparison effect on students’ inferred FAT self-concept of a peer in the direction contrary to that expected, resulted from the fact that the experimental studies examined temporal comparisons between achievements in ability tests of similar difficulty. For this reason, participants could have interpreted achievement improvements to mean that students needed some practice in order to perform above-average, which could thus be indicate of relatively low domain-specific abilities. In contrast, achievement declines could have been considered as an indicator of relatively high domain-specific abilities, as students with achievement declines had demonstrated above-average achievement without much practice.

Unlike the experiments on the joint effects of social, dimensional, and temporal comparisons, temporal comparisons in the school context typically refer to achievements based on increasingly demanding assessment criteria. Thus, achievement improvements in school indicate that students have increased their competencies to an above-average extent, whereas achievement declines indicate that students have not increased their competencies to the desired extent. However, achievement declines in school do not imply that students’ competencies have necessarily declined. Therefore, it is plausible that students in school are more inclined to interpret achievement improvements as indicative of high domain-specific abilities, compared to students who receive achievement feedback on two ability tests of similar difficulty. However, empirical research would be necessary to test this conjecture.

In addition to examining temporal comparison effects in a longitudinal experiment, a particular goal of this study was to examine whether the strength of comparison effects depends on the perspective from which students assess a peer’s abilities. Interestingly, I found significantly stronger social comparison effects on students’ own assessment of a peer’s ability than on students’ inferred self-concept of a peer in the figural and verbal domains, whereas the dimensional comparison effect on students’ inferred FAT self-concept of a peer was significantly stronger than that on their own assessment of a peer’s ability to solve FAT tasks. Hence, it seems that social comparisons have a particularly strong influence when students evaluate the abilities of others. In contrast, dimensional comparisons could be given greater weight when students (pretend to) assess their own abilities. A stronger focus on dimensional rather than social comparisons to assess one’s own abilities would make sense, considering that the desire for self-differentiation has been shown to be a key reason for students to conduct dimensional comparisons (Wolff et al., 2018a). Nevertheless, this conclusion should be viewed with caution, since the dimensional comparison effects on students’ inferred VAT self-concept of a peer and on their own assessment of a peer’s ability to solve VAT tasks did not differ from each other.

6.2 Additional findings

As explained in Sect. 4.1, I operationalized the comparison effects by comparing students’ ability ratings between those experimental conditions that triggered upward versus downward comparisons for the respective comparison type (and lateral comparisons for the other two comparison types). I used this approach to be consistent with previous studies and to achieve a high power. Nevertheless, analyses of the ability ratings in the conditions triggering upward and downward comparisons for one comparison type in relation to the ability ratings in the Reference condition, triggering lateral comparisons for all three comparison types, yielded some interesting findings.

The first interesting finding is that five of six ability ratings (students’ own VAT self-concept, students’ inferred FAT and VAT self-concepts of a peer, and students’ own assessments of a peer’s ability to solve FAT and VAT tasks) were significantly lower in the corresponding conditions triggering dimensional upward comparisons compared to the Reference condition, whereas none of the ability ratings differed significantly between the corresponding conditions triggering dimensional downward comparisons and the Reference condition. This finding contradicts the assumption of a positive net effect of dimensional comparisons, originally formulated in dimensional comparison theory (Möller & Marsh, 2013). According to this effect, the positive effects of dimensional downward comparisons on students’ self-concepts would be stronger than the negative effects of dimensional upward comparisons. However, although initial studies suggested the existence of a positive net effect of dimensional comparisons (Pohlmann & Möller, 2009), a more recent meta-analysis did not find evidence of a positive net effect (Müller-Kalthoff et al., 2017b). The finding of the present study that for most ability ratings examined, only the effects of dimensional upward comparisons were significant, provides additional support for this conclusion. Moreover, one could even be inclined to interpret this finding as evidence of a negative net effect of dimensional comparisons. However, it is important to note that the effects of dimensional upward and downward comparisons did not differ significantly for any of the six ability ratings (as can be inferred from the 95% confidence intervals), so that this conclusion would go too far.

A second finding worth discussing is that three of the four external ability ratings (students’ inferred FAT self-concept of a peer and students’ own assessments of a peer’s ability to solve FAT and VAT tasks) were significantly more strongly affected by the feedback triggering social upward comparisons than by the feedback triggering social downward comparisons (as can also be inferred from the 95% confidence intervals). Thus, the students in this study were more inclined to devalue the abilities of their peers performing below-average than to enhance the abilities of their peers performing above-average. An explanation for this finding could be self-enhancement motivations, since the students could have been able to develop a more positive perception of their own abilities by devaluing their peers’ abilities (e.g., Wolff et al., 2018a). This explanation would also align with the finding that the effects of social upward and downward comparisons did not differ when students assessed their own self-concepts (although stronger effects of social downward comparisons would also have been plausible to serve self-enhancement motivations). The finding that stronger effects of social upward comparisons than of social downward comparisons were also shown on students’ inferred FAT self-concept of a peer seems to contradict the assumption that the stronger effects of social upward comparisons resulted from self-enhancement motivations, at least at first glance. However, it is conceivable that self-enhancement motivations had such a strong (unconscious) influence in this study that the students were only partially successful in putting themselves into the shoes of their peers, to infer their self-concepts.

Finally, it is interesting to note that students’ inferred FAT self-concept of a peer and students’ own assessment of a peer’s ability to solve FAT tasks were significantly lower in the condition triggering temporal downward comparisons compared to the Reference condition, although temporal downward comparisons were assumed to enhance students’ ability ratings. Moreover, in all other analyses, students’ ability ratings in the conditions triggering temporal downward comparisons were descriptively lower than in the Reference condition. Given these findings, one could speculate that students tend to rate abilities more highly in the case of constant achievement across time, than in the case of achievement changes (around the same achievement level). Although this assumption is plausible, especially in regard to developments in achievement in tasks of equal difficulty (as was the case in the present study; see also discussion about temporal comparison effects in Sect. 6.1), it would be interesting to pursue this hypothesis in more detail in future research. In particular, this is true for non-experimental studies, in which researchers have usually investigated the effects of students’ achievement changes on their self-concepts using linear rather than quadratic regressions.

6.3 Strengths and limitations

This study has a number of strengths. In particular, it has examined the joint effects of social, dimensional, and temporal comparisons, for the first time, in a longitudinal experiment, for different types of ability assessment, and in different domains. Nevertheless, the study also has some limitations.

First, in many cases the experimental manipulation was not successful. Perhaps this was because the achievement feedback, consisting of four percentile ranks, included too much information for some students. This explanation seems plausible, given that the students were particularly likely to respond to the items that asked them to rate their own or their peer’s achievement in dimensional comparison (i.e., those items that required the students to relate all four percentile ranks to each other, rather than just two percentile ranks) in a way that was inconsistent with the experimental manipulation. Because of manipulation check failures, I had to exclude a significant number of students from the analyses. Moreover, I conducted the analyses with different subgroups of students, to avoid having to exclude more participants. Nonetheless, even after excluding students, this study still had acceptable power and more participants than most previous experimental studies examining comparison effects.

Second, it is possible that realizing this study in small groups and integrating the feedback about a peer’s achievement triggered the salience of social comparisons. Consequently, students may have paid relatively little attention to the dimensional and temporal comparison information when making their ability assessments. However, the design of this study has the advantage of a relatively high ecological validity. For example, students at school usually take achievement tests in groups (i.e., within their class) and often receive information not only about their own achievement, but also about the achievement of individual classmates (e.g., their seatmate), when their teachers give them achievement feedback.

Finally, it is worth mentioning that the students did not receive feedback on their achievement in the ability tests from T1 before T2. For ethical reasons, I refrained from leaving students alone with manipulated feedback for two weeks. Nevertheless, it is possible that temporal comparisons would have had a (stronger) impact on the formation of students’ ability beliefs if students had already received achievement feedback at T1.

7 Conclusion

This research provides additional support for the causal impact of social and dimensional comparisons on different types of ability belief. Yet, it remains unclear to what extent temporal comparisons are involved in the formation of students’ ability beliefs. My assumption that previous experimental studies did not find significant temporal comparison effects because the temporal comparisons related to periods of only a few minutes no longer seems plausible in view of the results of this study. Still, further research is necessary to explain the inconsistent findings on the influence of temporal comparisons on students’ ability assessments from different (kinds of) studies. This is particularly important with respect to the development of a comprehensive comparison theory of academic self-concept formation that integrates social, dimensional, and temporal comparisons.