We conducted two user studies with email data to understand 1) the search performance of the IVF tool compared with the traditional query search interface and 2) how the features of the tool are used in realistic search scenarios.
We evaluated the search performance of the IVF tool by comparing with a baseline system through a within-subjects design. We asked users to use both systems to find target emails with varied search difficulties. The baseline system mimics the typical query search interface with the linear and categorical facets replaced by a longer list of email snippets while maintaining the same free text search facility as the IVF tool (Fig. 7).
We used and preferred the Enron email corpus (Cohen 2015) over participants’ personal email collections to have the same set of questions for the participants and to ensure complete control of factors, such as the amount of information known about the target email and the last time the email was viewed. The Enron email corpus provides an extensive collection of real emails from various users. We opted for a subset of the emails received by two of Enron’s important managerial officers with a similar number of emails (2500 and 2142). Each email set was used in only one of the two systems (the IVF tool or baseline).
We recruited 16 participants through advertising at the university. All participants (nine females and seven males, age range: 22-42, age mean: 25.8) were university students from diverse areas, including Computer Science, Psychology, Cognitive Science, Nursing, Chemistry, and Linguistics. Half of the participants stated that they often use Gmail, while the rest mentioned OS X Mail, Outlook 365, and the university’s Webmail. None of the participants had ever heard of the Enron email collection. Each had two movie tickets as compensation for the participation.
We evaluated each system on three types of tasks that differed in difficulty levels based on the amount of information provided about the emails to be found (Table 2). T1 was easy as specific unique keywords about the email are known, and a search query with these keywords would return no more than ten emails. T2 and T3 were harder than T1 as a search query with any of the words provided would return (if any) around 30-50 emails, i.e. none of the keywords uniquely identify the target email. All the information required to identify the correct email in all tasks was visible in the email snippets (without the need to open and read the entire email).
In T1 and T2, the month and year when the email was sent were known, but in T3, a 2-month range was provided. The sender was known in T2, but in T3, two possible senders were provided. We indicated whether the mentioned sender was the first or last name of the person or whether it was the name as it appeared in the contacts list. Based on the tasks we collected from our second study, which comes later in this section, we thought these tasks cover realistic email finding tasks while ensuring that the participants were provided enough information to find the correct email.
With the within-subjects design, we counterbalanced the order of both the two sets of emails and the two systems, so there were four possible orderings. The order of the tasks was fixed from easy to hard. Each task level for each dataset contained two questions, the order of which was randomised.
For each system, the procedure consisted of a training, a practice, and an actual task session. Pilot studies were conducted to ensure the viability of the procedure. The training session comprised: (1) a live demonstration of the system and its search facilities; (2) three questions from easy to hard, over a prepared dataset with 530 emails in the ‘inbox’ directory of Enron’s vice president, Barry Tycholiz, which the participants had to complete. After training, the participants read about the Enron employee they had to impersonate and then carried out three practice questions in the order of easy to hard. During the training and practice sessions, participants could ask questions to the experimenter to resolve any confusion. The actual task contained six questions as mentioned earlier (two for each task level; easy to hard). We set the time limit for each question as 5 min. In total, the experiment took an hour on average.
The experiment for both systems was run on a 3.40 GHz quad-core PC with 16 GB RAM using a 21” monitor with 1920 * 1200 resolution. For the IVF tool, 12 emails were visible in the list of email snippets, and 400 emails were visible in the timeline visualisation in one instance. For the baseline, 34 emails were visible in the list of email snippets in one instance, with the possibility to see 400 emails through scrolling (Fig. 7).
To evaluate task performance, we measured the time taken to find the email requested by each task question and whether the correct email was successfully found or not (success was 0 if the email was not found within 5 min). Considering the interaction the IVF tool provided was less familiar to the participants than the typical query interface, we hypothesised that for the three task types, the task performance of using the IVF tool is comparable to the baseline system.
For each participant, we computed the mean success and time for each task type and interface. Figure 8 shows the box plots of success and time per interface and task. As the distributions of the time and success data were not normal, we used non-parametric tests for the analysis. Wilcoxon Signed-rank tests showed the two interfaces did not exhibit statistically significant differences in task performance regarding the three task types, whereas Friedman tests indicated that the three task types impacted task performance significantly for both interfaces (Table 3).
Additionally, we investigated the learning curve of the IVF tool by analysing whether the IVF tool significantly improved task performance on T3 relating to T1, compared with the baseline. To do this, we calculated the time and success differences between the two interfaces on T1 and T3 and performed Wilcoxon Signed-rank tests between the differences. No significant result was found, which evidenced that the IVF tool did not have a steep learning curve and was easy to use.
Thus, we confirm the hypothesis and conclude that the IVF tools appeared to be easy to learn and have a comparable performance compared with the query search interface.
To understand how people use the IVF tool’s features in practical exploratory search scenarios, we conducted a user study involving realistic email search tasks. Locating a specific email in a personal email box with limited memory cues is difficult. Research shows that memories may be organised by episodes, such as the location and relative time of an event (Elsweiler et al. 2008). We pondered that user interaction with the tool’s facets could assist users’ memories and widen search opportunities. In this study, we investigated how users interacted with the coordinated facets to locate a specific email by analysing user interaction data and users’ perception of how helpful facets were for email finding through questionnaires.
We recruited 11 participants from two large universities (6 graduate students, 4 postdoctoral researchers, and 1 administrative staff). A pre-screening questionnaire was administered to ensure that the participants performed email searches at least three times a week and that their email accounts contained more than 300 emails. Four participated the study with their personal email accounts (which they also used for work and study related correspondence) and seven participated with their institutional accounts. All participants were fluent in English and most of their email correspondence was in English. Each participant received two movie tickets as a reward for the completion of the study.
To address privacy issues related to personal data, we adopted a diary-keeping method from Elsweiler et al. (2008) to collect email search tasks. To increase the difficulty of the tasks, we carried out the email search experiment 90 days after the diary study was completed.
Diary study The participants were asked to keep an online diary for 30 days to record all instances in which they had to find an email. We defined the scope of finding as all types of search actions, including typed queries. In total, we collected 127 diary entries (4–18 entries from each participant). We excluded 26 of these entries because (1) they were repeated entries; (2) they did not refer to a certain target email, but rather described cases in which the participant wanted to make sure that there were no emails with specific information; or (3) the target email was in the sent box. This left us with 101 entries, which we then used as email search tasks related to the participant’s inboxes.
Lab experiment The study started with a training session in which each participant completed 4 training tasks that involved finding emails in an unfamiliar inbox, an employee account from the Enron corpus (Cohen 2015), using various combinations of available information, such as sender, topic, co-recipient, and date. Then, we extracted emails within a recent 2-year span from participants’ inboxes to the IVF tool. Participants were asked to find the emails mentioned in their diaries using a desktop computer with a 24-inch monitor. The tasks were introduced in random order and in the form of task cards. Each task card included a diary entry the participant had previously written and a questionnaire. There were no time constraints for executing the tasks. The participants indicated task completion either by opening an email and confirming it as the correct email (success) or by clicking on the “Give up” button on the screen (failure). Participants could skip a task if the task definition written in the diary was too vague or if the specific email was not included in the participant’s inbox anymore. An experimenter was present during the study to make sure the procedure was followed and answer any technical questions.
During the search tasks, we logged all user interactions and the state of the interface at those moments including filtering criteria, suggestions, and the number and distribution of emails in the timeline interface. To maintain privacy, our logs did not include any textual content from participants’ accounts.
To help clarify the log data gathered during each session, we administered a questionnaire at the end of each task. In the questionnaire, the participants were asked whether they found relevant contacts or keywords from the suggestions and whether the suggestions supplemented any of the information that was missing at the beginning of the search session.
In total, we obtained 73 session logs. For seven of these tasks, participants remarked that they did not remember which emails their diary entries referred to. Of the remaining 66 tasks, 58 (88%) were completed successfully, and 8 (12%) were unsuccessful.
Of the 66 tasks, 23 tasks were query-only, i.e. they did not include any facet use. Thirty tasks were mixed sessions in which participants used queries in combination with facets. Most of these sessions (20/30) started with queries to reduce the number of emails in the display before participants proceeded to use facets. Thirteen tasks did not involve any use of typed queries. A majority of the tasks in this group (7/13) started with a timeline navigation action, such as clicking on a month, and then relied on suggestions or the timeline to select emails. Participant-wise, some were more oriented toward submitting typed queries, whereas others relied on timeline navigation and suggestions, but the difference in using query or facet was not statistically significant among the participants (Fig. 9). Search strategies also varied among tasks that belonged to the same participant, i.e. participants adapted their strategies to different tasks.
The timeline was used in 39 (59%) sessions (Table 4). In 17 (44%) of these sessions, specific timeline navigation actions, such as pointing to months and directly selecting dots, led to the selection of the correct email. Selecting emails through timeline navigation was rarely used at the tasks’ initial stages, but typically followed query actions or month-navigation actions. Suggestions were used in the majority of search sessions, 38 (58%) out of the 66 search sessions. In 23 (61%) of these sessions, suggestions led to the selection of the correct email either directly or when used as queries.
Suggestions were generally used following data-specification actions, such as querying or navigating to a month (35/38). A possible explanation is the lack of relevant suggestions at the start of a search session. The log data indeed shows that relevant suggestions (regarding the emails marked as correct by the users) were not encountered as often at the initial stages of the search compared to later stages, i.e. after participants made data-specification actions.
The timeline can provide a context to support suggestion discovery. Out of the 72 instances of suggestion usage, 32 (44%) of them were accessed through the timeline, i.e. participants accessed suggestions when hovering overtime periods. This reveals that the timeline can provide a context for suggestion exploration.
Suggestions were used more frequently for selection than filtering. Among 72 instances of suggestion use, suggestions were dragged to the query area for filtering only 11 (15%) times (Table 5). A Wilcoxon signed-rank test showed using suggestions for selection was statistically significantly more frequent than for filtering among the participants with a p value = 0.037 and an effect size r = 0.645. In other words, participants generally preferred to keep their current context to inspect items rather than decreasing the size of the item space. Additionally, we collected 13 instances of filter-swipe usage, a small portion (18%) of the 72 total instances of suggestion usage.
The most commonly used suggestions were contacts. Among the 72 instances in which suggestions were used, the overwhelming majority, 63 (88%), were contact suggestions (Table 5). A Wilcoxon signed rank test showed contacts were statistically significantly more often used than keywords by the participants with a p value = 0.006 and an effect size r = 0.847. Of the 11 instances in which suggestions were used as filters, all were contact suggestions. The subjective evaluation of the suggestions through the questionnaire (Table 6) shows that the contact suggestions were found relevant for 35 search tasks (53%) and supplemented missing information in 13 tasks (20%). For keyword suggestions, the respective figures were 18 (27%) and 11 (17%). A Wilcoxon signed-rank test showed participants tended to find contacts more relevant for email finding than keywords with a p value = 0.054 and an effect size r = 0.585, which echoed the earlier findings on email search queries that most queries referred to people and especially to senders (Harvey and Elsweiler 2012). However, in actual utility, participants perceived contacts and keywords equally supplemented missing information.
In summary, the 66 search cases demonstrate that the IVF tool facilitates email finding. A majority (65%) involved the use of facets to guide searches. Dynamic query suggestions through the timeline navigation could help discover relevant suggestions in the context (44% of suggestion usage, DR1) in which contact suggestions were more often used than keyword suggestions (effect size = 0.847, p = 0.006). The design of using facet values to select items without filtering the item space (DR2) was favoured over using facet values as queries to filter the item space among the participants with an effect size = 0.645 and p = 0.037.