Keywords

1 Evaluation of UX

1.1 Previous Approach

The UX (User Experience) can be defined as “person’s perceptions and responses resulting from the use and/or anticipated use of a product, system or service (ISO9241-11:2010) [1].” As an extension of this definition, Kurosu [2,3,4,5] proposed that the quality in design including the usability is the basis for the quality in use and the latter is related to the UX. Thus, the UX should be evaluated or measured in relation to wider range of quality characteristics including usability, functionality, performance, reliability, safety, attractiveness, etc.

From this perspective, evaluation methods that have been proposed until now will be screened to a small set of methods, although there are so many so-called “UX evaluation methods” have been proposed as listed in the website of All About UX [6].

UX evaluation methods can be classified into two categories; real-time UX evaluation, and memory-based UX evaluation.

The real-time UX evaluation methods including the questionnaire can obtain the information in-situ and just in time. But, because of their invasive nature, i.e. informants are requested to answer the question during their everyday life and it is sometimes an obstruction for them, it is not recommended to repeat the survey for more than a few weeks. In other words, real-time UX evaluation methods can be repeatedly applied only for a limited time range.

On the contrary, memory-based UX evaluation methods do not have such temporal limitations. They can be applied to the UX over the long-term period. But they do have limitations originated from the nature of human memory. People forget many events even though they were important, and people may also edit or change the contents without any ill will. Hence the validity and reliability are important in terms of the memory-based methods.

Anyways, real-time evaluation methods include the questionnaires and methods for evaluating emotion and other methods. The questionnaire includes SUS [7], SUPR-Q [8], Product Reaction Card [9] and AttrakDiff [10]. The evaluation methods for emotion include 3E [11] and Emo2 [12], and the other real-time evaluation methods include ESM [13], and diary methods such as DRM [14, 15] and TFD [16].

Memory-based methods include CORPUS [17], iScale [18], UX curve [19], UX graph [20] and ERM [5, 21].

1.2 ERM

ERM was proposed by Kurosu et al. [5, 21] based on the reflection of advantages and disadvantages of memory-based methods until now. Similar to previous methods, informants are asked past events (episodes) and the rating for them. But the curve or graph will not be drawn in ERM, because of the idea that the memory will not have such preciseness as can be represented as the coordinate on time scale. As can be seen in Fig. 1, informants are given only 7 rough time zones: expectation, purchase, early use, use, recent use, present time and near future that include all phases of experience for an artifact (product, system or service).

Fig. 1.
figure 1

An example of ERM regarding the university education. (translated from Japanese and cited from Kurosu et al. 2018)

Each time zone means;

  • Expectation: estimation of UX before the purchase

  • Purchase: evaluation of UX at the purchasing or obtaining the artifact

  • Early use: evaluation of UX just after the purchase (around a few weeks to a few months)

  • Use: evaluation of UX after the early use until the recent use. This may range from a few weeks to several years depending on the time of purchase and the time of survey

  • Recent use: evaluation of UX just before the time of survey (around a few weeks to a few months)

  • Present time: evaluation of UX at the time of survey

  • Near future: expectation and/or estimation of UX after the time of survey

Although ERM uses the letter-sized paper and the number of the row is limited, informants are allowed to write more than one episode in a row. Every episode should accompany the rating on the feeling from positive (+10) to negative (–10), i.e. 21 points scale is used.

1.3 Reliability of ERM

Similar to the psychological test, UX evaluation methods should possess a certain level of validity and reliability. Regarding the validity especially the content validity, there will be no problem because informants are asked about their own experience regarding the use of an artifact they’ve been using. But the reliability is not yet confirmed and there seems to be no researches conducted to investigate this issue. This is the reason why this paper deals with the reliability issue of the memory-based UX evaluation method.

2 Verifying the Reliability of ERM

2.1 Method

The basic idea of the verification of reliability of ERM is to compare the result of two surveys for the same group of informants, i.e. re-test method. Luckily, I had a class at the graduate school and thought that I should ask students to collaborate for the survey twice, once at the first lecture and another at the last lecture of the semester. And the first survey was conducted on Sep. 25, 2017 and the second survey on Jan. 22, 2018. There were 119 days between two surveys. It would have been almost impossible for the informants to remember what they answered at the first survey when they were subject to the second survey.

Informants

There were 26 students registered to my class and the attendants at the lecture on Sep. 25, 2017 and Jan. 22, 2018 were shown in Table 1. Because of the absence at each lecture, total of 23 students attended either of the lecture and, from among them, 17 students attended both lectures resulting 17 data available for the analysis. Unfortunately, it became clear that 3 of them purchased a new model during the analysis and were excluded from the analysis that followed. Finally, 14 data were actually used in the total analysis. To our regret, all the students were male perhaps because my class was opened at the engineering department.

Table 1. Attendants of the class on Sep. 25, 2017 and Jan. 22, 2018

Targeted Device

Because all of the students were using the smartphone (not the cellphone), it was decided to be the targeted device. Because it is the multi-purpose device and almost all users are using it daily, or more to say, many times during the day, its user must have various experiences from a positive one to a negative one.

Procedure

ERM sheet was delivered to each of the informants, then a brief instruction was given, and 30 min were allowed to write down their experiences. During the instruction, informants were told that

  • This is to ask you to write down your personal experience on the smartphone

  • First, write down your university ID, sex, age, and description of the smartphone

  • There are seven time-zones including the expectation, purchase … near future (with the explanation of each time range)

  • You are requested to write down what you experienced in the episode slot and the degree of satisfaction vs. dissatisfaction or positive feeling vs. negative feeling from +10 through 0 to –10 depending on your subjective impression in the rating slot. Please use the integer

  • You may begin with the expectation to the near future, but it depends on your feeling which time zone you would write. You can go back to the previous time zone that you have already filled

  • Although there are limited number of slots, you can write two or more episodes and ratings in one slot if you need more than one

  • If you don’t remember what you experienced at any time zone, you can skip it

Obtained Data

Handwritten ERM sheets were obtained and were input to Excel, then were translated into English. Appendix shows all the raw data.

3 Results

3.1 Rating Results

Rating values correspond to the vertical position of the curve/graph in CORPUS, iScale and UX curve. But unlike those methods, ERM only separates rough time zones corresponding to the quasi-continuous horizontal position in them.

One point that should be warned for the use of the time zone during the reliability verification is the meaning of each time zone in the first survey and in the second survey. As shown in the imaginative data of Fig. 2, each time zone shifts bit by bit depending on the displacement of the time when the surveys were conducted. For example, to an imaginative informant in this figure, the smartphone was purchased, of course, at the same time. But the following time zone represents a bit displaced physical time depending on when the survey was conducted. For example, the present time at the first survey was Sep. 2017 while the present time at the second survey was Jan. 2018. This displacement will become larger as the span between the purchase and the survey becomes shorter.

Fig. 2.
figure 2

Chronological table showing the difference of meaning of the time zone. In this imaginary data, the same informant purchased the smartphone in early 2017, but the time zones from early use to near future slip out of place in the first survey and the second survey depending on the time when the survey was conducted (2017 and 2018).

Fortunately, informants who purchased their smartphone in 2017 (informant C, F, H, I, L and M) showed similar ups and downs of the rating value for the first survey and second survey as can be seen in the following sections. This may mean that the time zones for the informants were not exact but rough, hence horizontal position in curves/graphs in CORPUS, iScale and UX curve were not exact based on the equal time unit and thus they should be called quasi-continuous.

3.2 Reliability Measure

Usually, the reliability (ρ) is represented by the correlation coefficients (r). In this study, Kendall’s coefficient of concordance (W) was also calculated. These values were calculated based on the average rating for 7 time zones. The distribution of r is shown in Fig. 3 and that of W is shown in Fig. 4. These graphs show rather high reliabilities.

Fig. 3.
figure 3

Distribution of Spearman’s correlation coefficients for 14 data.

Fig. 4.
figure 4

Distribution of Kendall’s coefficient of concordance for 14 data.

3.3 Episode Results

Episodes were verbal in nature, thus will be analyzed one by one in the next chapter.

4 Analysis of Each Data

ID of each informant was randomly assigned. Episode is assigned its ID as <informant ID><episode number>-<year> , e.g. A4-2017. Please refer to the Appendix.

Informant A generally gave positive ratings except for the size of device that was negatively rated (A12-2017 and A13-2018) and its weight that was also negatively rated (A12-2017 and A8-2018) (Fig. 5).

Fig. 5.
figure 5

Ratings by informant A

Informant B generally gave positive ratings especially in 2017. But he rated negatively regarding the future (B10-2017, B10-2018) that the device may not be able to correspond to the new applications (Fig. 6).

Fig. 6.
figure 6

Ratings by informant B

Informant C gave strange ratings to the same aspect of the device that the specification is almost the same with the previous cellphone, one negatively and another positively (C2-2017, C2-2018) (Fig. 7).

Fig. 7.
figure 7

Ratings by informant C

In the beginning of use, informant D felt a negative impression on the trace of finger print on the touch-panel (D2-2017, D4-2018), but his evaluation gradually became positive during the usage (Fig. 8).

Fig. 8.
figure 8

Ratings by informant D

Informant E showed a drastic change of ratings that was quite high in early days and changed into negative during the usage in terms of the battery life (E9-2017, E9-2018), and the wifi connection (E10-2017, E10-2018) (Fig. 9).

Fig. 9.
figure 9

Ratings by informant E

Informant F wrote episodes differently for 2017 and 2018. But there are some common episodes such as the speed (F4-2017, F5-2018), the quality of photograph (F7-2017, F6-2018), the convenience of second screen (F11-2017, F8-2018), etc. (Fig. 10).

Fig. 10.
figure 10

Ratings by informant F

Informant G wrote about the joy of accessing internet (G1-2017, G1-2018) and that of using net contents (G5-2017, G3&G4-2018) positively (Fig. 11).

Fig. 11.
figure 11

Ratings by informant G

Informant H wrote about the high expectation (H1-2017, H1-2018) and the screen quality (H2-2017, H2-2018). Generally his ratings are higher in 2018 (Fig. 12).

Fig. 12.
figure 12

Ratings by informant H

Informant I gave no negative ratings. Positive evaluations are for the processing speed during the expectation (I1-2017, I1-2018) and the early use (I3-2017, I3-2018). Strangely he rated the fast battery loss positively (I10-2017, I9-2018) and he might have misunderstood the instruction (Fig. 13).

Fig. 13.
figure 13

Ratings by informant I

Informant J had different expectation one negatively for the poor operability (J1-2017) and another positively for the good performance (J1-2018). This informant did not give the consistent episodes except for the present time evaluation (J12-2017, J12-2018) (Fig. 14).

Fig. 14.
figure 14

Ratings by informant J

Informant K generally gave positive ratings except recently for the finger print recognition (K9-2017, K9-2018) and the unexpected break down (K10-2017, K10-2018) (Fig. 15).

Fig. 15.
figure 15

Ratings by informant K

Informant L gave negative ratings only recently for the lack of storage (L9-2017, L9-2018). Another negative evaluation was given differently, one to the heat (L10-2017) and another to the system freeze (L10-2018) (Fig. 16).

Fig. 16.
figure 16

Ratings by informant L

Informant M complained for the same problem of gyro sensor that occurred during the early use (M3-2017, M3-2018). He still pointed out that problem at the present time (M7-2017, M7-2018) (Fig. 17).

Fig. 17.
figure 17

Ratings by informant M

Informant N pointed out the usability of the plastic case cover (N2&N3&N4&N9&N10-2017, N3&N9-2018) generally a bit negatively (Fig. 18).

Fig. 18.
figure 18

Ratings by informant N

5 Conclusion

The reliability of ERM was tested in terms of the smartphone using the re-test method. Two surveys for the same 14 informants who continued to use the same model were conducted one on Sep. 2017 and another on Jan. 2018. By the use of ERM, episodes and subjective ratings were obtained for 7 time zones including the expectation, purchase, early use, use, recent use, present and near future. Two reliability measures (one is the correlation coefficient and another is the coefficient of concordance) were calculated and relatively high reliability was confirmed.

Based on the content analysis for each informant, the same episodes were found around the same time zone and were rated in the same way. This also confirmed the high reliability of ERM