1 Introduction

The batch-style evaluation approach integrated with one or two unified evaluation metrics (e.g. precision, recall, normalized discounted cumulative gain) has been widely applied in a large body of ad hoc retrieval tasks and evaluation experiments (Chen et al., 2017; Harman, 2011). While employing one unified measure across different queries may facilitate the comparison of system performance in varying topics and search contexts, it may undermine the effectiveness of evaluation experiments in capturing users’ actual search experiences, especially in prolonged, interactive search sessions (Cole et al., 2009). When engaging in complex search tasks that involve ill-defined, ambiguous goals, users often go through varying cognitive states, seek to fulfill different search intentions, and thereby evaluate system performances differently under varying queries (Liu et al., 2020; Sarkar et al., 2020). Under these circumstances, intelligent search systems will need to adaptively adjust the evaluation and re-ranking strategies to better respond to users’ changing information needs and search obstacles under overarching motivating tasks.

To achieve this, researchers need to deploy and meta-evaluate state-aware evaluation metrics that can reliably connect the evaluation of search system performance with users’ actual experiences in search interactions and partially address the limitations of traditional offline evaluation procedures (Liu, 2022). Furthermore, it is critical to design and implement new evaluation metrics that can achieve better performance than existing metrics in capturing search satisfaction levels under specific states and search scenarios. The knowledge and techniques learned through state-aware evaluation research and practice will allow researchers to better capture the nuances hidden in the cognitive process behind IR and to develop more fine-grained user models to support adaptive ranking and search recommendations.

To address the research gap discussed above, our study sought to predict the varying task states that a user goes through in different complex search tasks based on observable search signals and to identify appropriate evaluation metrics that best reflect user satisfaction under each state. Taking a step forward, we also explored and meta-evaluated new evaluation metrics that could better reflect in-situ user satisfaction than all existing measures. To obtain robust and potentially generalizable evaluation results, our meta-evaluation experiments were conducted based on datasets collected from both controlled lab and naturalistic search settings.

Going beyond traditional multi-system evaluation setups built upon one or two cross-session unified metrics, our study connects IR evaluation to dynamic task states and makes the following contributions:

  • Our study demonstrates the relationships between in-situ user satisfaction and different evaluation metrics, and shows that the best-performing metrics vary across different task states within individual task-based search sessions.

  • Based on observable search behavior and textual features that can be collected from the backend, we developed machine learning-based classifiers that can predict users’ task states during search sessions. These predictive models can serve as the basis for adaptive and even proactive search recommendations and evaluations.

  • In addition to comparing and evaluating existing metrics under different states, we also developed new evaluation metrics that can outperform all current metrics under certain task states. Our new metrics could be replicated and reused in a wider range of search evaluation scenarios and contribute to the enhancement of human-centered IR evaluations.

2 Related works

Many evaluation metrics have been widely used throughout the years, the metrics were built under various assumptions and assessment goals, such as search result relevance, user perceived usefulness, user satisfaction. Based on the information needed for computing the evaluation metrics, we can categorize existing metrics into (1) online metrics, which can be computed based on system log files containing user interaction records, and (2) offline metrics, which rely on external knowledge such as human annotations. In this paper, we investigate a wide range of common evaluation metrics that can be computed on our experimental dataset. The full list of metrics is presented in Sect. 5. In this section, we give an overview of the existing works investigating the relationship between common evaluation metrics, user satisfaction and search states.

2.1 Task states in interactive search sessions

In contrast to the simulated scenarios of ad hoc retrieval tasks, users engaged in complex search tasks often experience the transitions between task states and aim to fulfill different subgoals or intentions at different moments of a search session (Liu et al., 2020; Rha et al., 2016). Mitsui et al. (2016) examined users’ search intentions associated with different queries in the same task and developed behavior-based prediction models for identifying users’ intentions in real-time. Chen et al. (2021) found that user reformulation is closely related to user intent and incorporated this knowledge into click-based metrics, improving the correlation with user satisfaction. Vuong et al. (2019) introduced a categorization of queries by intention, task goal, and task substance. Similarly also Järvelin et al. (2015) looked at the different types of tasks and suggested that this should be included in IR. Borlund (2016) investigated which type of information is needed in which type of task. Liu et al. (2019b) developed a multilevel model of task-based information seeking and found that users’ search tactics and document judgments vary significantly across different intentions and task types. There is also some research investigating the use of user models to improve the correlation with user satisfaction (Moffat et al., 2022; Wicaksono & Moffat, 2020, 2021; Zhang et al., 2020b). Researchers have also extracted four task states, i.e. exploration, exploitation, known-item and evaluation, from participants’ in-situ intention annotations, studied the transition patterns between different task states under complex tasks of different types (Liu et al., 2020), and developed state-aware search path recommendation algorithms that can improve the efficiency of search interactions in finding useful information (Liu & Shah, 2022). Urgo and Arguello (2022) applied a state-based approach to investigate the Search as Learning (SAL) process (apply, evaluate and create) and characterized the transitions between different knowledge types (factual, conceptual and procedural) during search. In addition to user-centered task modeling and evaluation, researchers have also adopted a similar state-based approach in offline simulation-based studies and demonstrated the value of leveraging task state information in improving relevance-based ranking performance (Luo et al., 2014).

Since users typically go through multiple task states and intentions during the same search session (Liu et al., 2020; Ruotsalo et al., 2014), the evaluation of search systems should also be adaptive and customized based on the nature of local task states rather than relying on one or two unified measures across all search queries (e.g., nDCG, Reciprocal Rank) (Liu & Han, 2022). It is also unclear how and to what extent users’ criteria and thresholds of usefulness and satisfaction vary across states. To address these gaps, our study seeks to investigate heterogeneity across task states and construct state-aware evaluation metrics that best reflect users’ in-situ levels of search satisfaction under each state, rather than simply optimizing predefined document relevance metrics. The implementation and meta-evaluation of adaptive evaluation measures will also facilitate the development and evaluation of personalized IR systems and search recommendations.

2.2 Understanding and measuring user satisfaction

User satisfaction has been described in many papers as the golden standard for evaluating the quality of search results (Chen et al., 2017; Jiang et al., 2015; Zhang et al., 2018, 2020c). Many studies have investigated the factors that affect user satisfaction. For example, Jiang et al. (2015) concluded from their study that satisfaction can best be explained as the value of the search outcome compared to the degree of search effort. Liu et al. (2015) investigate whether there is a difference between assessors’ and users’ judgments of satisfaction. They find that assessors’ and users’ judgments are moderately correlated. Liu et al. (2018) investigated the differences between user satisfaction and search success for complex search queries. Their experiments indicate that there is a high discrepancy between user satisfaction and search success. These previous studies demonstrate that user satisfaction can be influenced and reflected by many user and system features.

As attempts to measure user satisfaction, existing works study relationships between various metrics and user satisfaction. Chen et al. (2017) conducted a meta-evaluation of a set of existing online and offline metrics on datasets collected from task-based lab studies to study how they correlate with user satisfaction. They found that offline metrics are better aligned with user satisfaction in homogeneous search, while online metrics outperform when vertical results are federated. Zhang et al. (2020a) found that task difficulty influences the correlation between metrics and satisfaction. This shows that how well existing metrics reflect satisfaction varies by task type. Chuklin and de Rijke (2016) developed the CAS model, which combines user clicks and attention behavior on a SERP to capture user satisfaction. Mao et al. (2016) attempted to use expert-annotated usefulness to measure user satisfaction and found that usefulness is strongly correlated with user satisfaction. This observation was also confirmed by Liu et al. (2019b). However, usefulness annotations are query dependent and subjective, the annotation of the same resource cannot be generalized to other queries or users, thus it is not an efficient metric for measuring satisfaction.

Despite efforts to measure user satisfaction through other metrics, existing works have not succeeded in finding an effective and efficient approach. Based on our investigation in existing works and datasets, we found the following potential reasons: previous works that attempted to measure user satisfaction did not consider the impact of task states; evaluation metrics created by combining existing metrics were fitted to a set of homogeneous data and investigated only a small set of features. Therefore, in this work, we investigate a larger set of features and take task state into account when analyzing existing metrics and fitting new evaluation metrics. In addition, we experiment on both lab study data and field study data to observe the impact of different data collection setups.

3 Task definition and research questions

In this paper, we consider the search task state for each individual query activity, which is defined by the sequence of a user’s actions starting with querying the Web, followed by browsing the search results, browsing the clicked Web resources, and clicking and scrolling activities.

3.1 Evaluation metrics

To ensure a fair and thorough analysis, we researched the most commonly used evaluation metrics for general IR systems. Based on the availability of the data needed to calculate each metric throughout the search process, we grouped them into 3 categories as follows:

  • Query-based metrics: Metrics that can be calculated immediately after a search query is executed.

  • Online metrics: Metrics that can be calculated based on system log files that record user interactions (e.g. mouse movements, clicks, timestamps of interactions, etc.). Query-based metrics are a subset of online metrics.

  • Offline metrics: Metrics that rely on external knowledge, such as annotations based on human judgement, e.g., the relevance score of web documents in a search results list.

The full list of metrics we analyze in this paper and their descriptions are given in Sect. 5.

3.2 Task states

Among the taxonomies discussed in Sect. 2.1, we found the taxonomy proposed by Liu et al. (2020) to be the most appropriate for the analysis in this paper, as it has been conceptually developed and empirically validated with both external labeling and clustering results under complex search tasks involving prolonged search sessions and covering task states and user intentions of varying complexity at different moments of search. Also, compared to existing taxonomies, Liu et al. (2020)’s task state taxonomy achieves a better balance between capturing the nuances of user intentions and being practically useful in participants’ annotations. The taxonomy does not involve overly abstract or broad categories (cf. informational queries in Broder’s taxonomy (Broder, 2002)) and distinguishes different search focuses or task states (e.g., exploring a new topic or domain versus evaluating collected information items) without requiring a detailed, cognitively challenging annotation process (cf. Rha et al. (2016)’s taxonomy).

Based on the labels and clustering results from two controlled lab studies, Liu et al. (2020) found that a querying activity can be assigned to one of the following 4 states:

  • Exploration state: The user wants to explore an unknown topic in this state. He uses general and short queries (e.g. “sports activities”).

  • Exploitation state: The user knows exactly what topic he is looking for in this state. He follows his search path and looks for different pages that might provide relevant information (e.g.“Football pitch nearby”).

  • Known-item state: The user knows exactly what his goal is and is looking for a specific page or information (e.g. “Location of the Football pitch in Friesdorf”).

  • Learn and evaluate state: In the fourth state, the user not only wants to passively absorb information, but also wants to evaluate search results or expand his knowledge. As in the known-item state, the user is looking for specific information (e.g., “Difference between the soccer fields in Friesdorf and Bornheim”).

3.3 User satisfaction

In this work, the level of user satisfaction refers to the extent to which a system informationally satisfies a user’s search goal(s) under the associated task. The satisfaction scores we use in this work are annotated by users directly.

We aim to investigate state-aware evaluation metrics for search systems in terms of user satisfaction. We approach the problem by answering the following research questions:

RQ1::

To what extent do existing evaluation metrics reflect user satisfaction under varying task states?

RQ2::

Can we detect the task state of a query activity using in-session signals that can be collected automatically during a search session without explicit feedback and labels?

RQ3::

Can we construct new evaluation metrics that better reflect user satisfaction under a specific task state?

4 Experimental data

4.1 Datasets

In order to study the characteristics of more diverse search sessions and to improve the generalizability of the results of this work, we consider two datasets collected under different setups: one from the field study (TianGong) and one from a lab study (KDD).

4.1.1 TianGong

This dataset was published by Zhang et al. (2020c). The authors conducted a field study that lasted for one month with 30 participants (13 females and 17 males) whose ages ranged from 18 to 41. The participants installed a browser extension to track their search activities. The Participants rate their satisfaction with the search result for each search query on a 5-point Likert scale, with 0 for dissatisfied and 4 for very satisfied. After the study, nine external annotators rated the relevance of the documents with respect to the corresponding query on a scale of 0–3, where 0 means a document is irrelevant and 3 means a document is very relevant. We use TianGong to refer to this dataset in the rest of this paper for simplicity.

The TianGong dataset contains 3875 queries. Each query is associated with logs containing the query string, the corresponding SERP, mouse movements (clicks and scrolls), switching between SERP and other browsed pages, and the corresponding timestamps or dwell times of the above activities. On average, 55 actions are recorded in the search logs for each query.

4.1.2 KDD19

This dataset was collected in a laboratory study and was published by Liu et al. (2019b). Fifty undergraduate students (24 female, 26 male) were recruited from the campus, ranging in age from 18 to 27. All participants were familiar with the basic use of web search engines and used them on a daily basis. Nine search tasks were given to each participant. Similar to the TianGong dataset, participants rated their satisfaction with the search result corresponding to each search query on a 5-point Likert scale, with 1 being dissatisfied and 5 being very satisfied. To obtain the relevance ratings for each document, the authors used a crowdsourcing platform. Each crowd worker was given a “query-document” pair. Then they were asked to assign a relevance score (0–3) to each document, 0 if they think the document is not relevant or a spam webpage, 1 if there is only a small amount of information in the document related to the query, 2 if there is important information related to the query in this document, 3 if the document should be a top result in the SERP because the content is dedicated to the query. A total of 1548 queries with search logs were recorded. On average, there are 188 actions per query. We use KDD to refer to this dataset in the rest of this paper for simplicity.

The distribution of the annotated search satisfaction and document relevance of the two datasets is shown in Fig. 1. The difference in the study setup can potentially explain the difference in the relevance and satisfaction distribution of the two datasets, i.e. the TianGong dataset has a higher percentage of irrelevant documents while having higher satisfaction. In the field study, where the tasks are not clearly defined, users are likely to be satisfied if some relevant information can be found with the self-formulated queries, while in the lab study, with a clear task in mind and queries that can be extracted from the task description, more relevant web resources are recalled, but users are not satisfied as long as the current task goal is not completed.

Fig. 1
figure 1

Distribution of user satisfaction (query level) and document relevance (document level) based on existing annotations in the TianGong and KDD19 datasets

4.2 Task state annotation

The task state of the KDD dataset was published by Liu and Yu (2021). We applied the same coding frame used in their paper to the TianGong dataset. More specifically, the task states used in the annotation task, including: (1) Exploration state—users explore unknown topics and seek to open new search paths; (2) Exploitation state—users may have a clear topic in mind and try to follow the current search path and continue to exploit the information patch at hand; (3) Known-item state—users know exactly what item(s) they are looking for. Queries tend to be very specific, and the target item(s) are usually obvious in the queries and the first documents visited; (4) Learning and evaluation state—users try to evaluate, extract and synthesize useful knowledge from retrieved documents and pages. At this state, they tend to have long, specific queries involving multiple subtopics and items, and move between and compare multiple documents. Two annotators annotated a subset of the data together in three rounds (100 unique queries in each round), discussing and resolving disagreements after each round. In the second and third round of annotation, the agreement between the two annotators in each round is both above 70%, the Cohen’s Kappa is above 0.559. Then one of the annotators finished the annotation for the rest of the dataset. The distributions of the task state labels in the two datasets are shown in Fig. 2.

Fig. 2
figure 2

Task states distribution in TianGong and KDD datasets

We found that the distribution of task states is unbalanced in both datasets. In the TianGong dataset, the exploitation state has the highest number of queries. In the KDD dataset, there are more known-item searches compared to other states, which could be due to the setup of the lab study, where search tasks are given and therefore the goals are more straightforward compared to natural search sessions. Similar observations can be made on the TianGong dataset as Liu and Yu (2021) made on the KDD dataset, that the last two states are hardly distinguishable based on the queries. Therefore, we use the same approach to merge the known-item and evaluation states in the experiments, and refer to the merged state as known-item in later sections. Finally, we obtain the ground truth label of task states for the two datasets, where there are 301, 514, 733 queries in the KDD dataset and 356, 2266, 1252 queries in the TianGong dataset for the exploration, exploitation, and known-item states, respectively.

5 Analysis of existing evaluation metrics

In this section, we will investigate RQ1 using the annotated data as described in Sect. 4 to see how well existing evaluation metrics reflect user satisfaction and explore whether there are differences for task states.

5.1 Evaluation metrics

As described in Sect. 3, we group the metrics into three categories according to the availability of information for their computation. The query-based category includes features that can be computed immediately after a user’s query behavior; the considered metrics and their descriptions are shown in Table 1. The list of online metrics that are extracted based on information in the search system log is shown in Table 2. Offline metrics are listed in Table 3. Note that we focus on query level evaluation, so all features are computed for each individual query. For query-based features, the terms appeared in previous queries and the order of the query has been considered as contextual information for the feature calculation of the current query.

Table 1 Query-based metrics
Table 2 Online metrics
Table 3 Offline metrics

5.2 Correlation analysis

To understand the relationship between the evaluation metrics introduced in Sect. 5.1, we compute the Pearson correlation between the user satisfaction score and each individual metric on both datasets for each task state. The results are shown in Table 4.

Table 4 Correlation analysis results

5.2.1 Results on TianGong dataset

The results indicate that the metrics perform differently under different task states, there is no single metric that achieves the highest correlation across all task states. The gap between different metrics is quite large, with the highest correlations being 0.422, 0.265, 0.267, and the median correlations being 0.252, 0.192, 0.194 on the exploration, exploitation, and known-item states, respectively. In terms of the metrics that have the highest correlation with user satisfaction in each task state, MaxR achieved the highest scores in the exploitation and known-item states, as in these two states users have a clearer goal in mind and are likely to be satisfied by the most relevant result. While for the exploration state, users consider more search results to get a better overview of the topic, therefore RBP and DCG based metrics achieve the highest correlation as they consider the relevance and rankings of multiple results.

The offline metrics generally have a higher correlation with user satisfaction in the exploration state than in the other task states. For example, for the CG@3 we have a correlation of 0.405 in the exploration state and only 0.232 and 0.233 for the other two task states. This is probably because users in the exploration state have less prior knowledge about the topic and are satisfied if documents of general relevance are returned. On the other hand, a user in the known-item state knows exactly what he is looking for. For example, if a user is only looking for a specific formula, all search results are likely to have higher relevance scores and would have less impact on user satisfaction, as other factors such as efficiency and quality of the web resource may be more important.

There is no clear pattern for online and query metrics. For example, PLC has the highest correlation with user satisfaction in the known-item state, while SessionEnd has the highest correlation for the exploration state. PLC is higher when a user gets there with fewer clicks on the top results. A user in the exploitation and known-item states has a clearer idea of the goal and is likely to be more satisfied if a result is found quickly. In the exploration state, the user does not have a clear goal and therefore has to try different queries until a document satisfies his information needs and he reaches the end of the session. There are negative correlations in the online metrics and only positive correlations in the offline metrics because the offline metrics are computed based on relevance annotations, so the more relevant the documents in the search result, the higher the offline metrics and the more satisfied the user is. For online metrics, on the other hand, the intuitions are more varied across metrics.

5.2.2 Results on KDD dataset

Similar to the TianGong dataset, it can be concluded that how well a metric reflects user satisfaction depends on the task state. The gap between the correlations of different metrics is also high, with the highest correlations being 0.529, 0.559, and 0.385, and the median correlations being 0.325, 0.343, and 0.219, on the exploration, exploitation, and known-item states, respectively. However, with respect to the best metric for measuring user satisfaction in different task states, the results are different from the TianGong dataset. The highest correlation in the exploration state is achieved by MaxR, while in the exploitation and known-item states it is achieved by RBP (x=0.8). By observing both datasets, we think that this may be caused by the different study setup. The KDD dataset is collected from a lab study where the search goals are given in the task description, the initial challenge is to formulate an appropriate query rather than exploring the different aspects of a topic. In this case, a highly relevant result in the exploration state that helps formulate the next query would satisfy the user’s search intent, which explains why MaxR has a higher correlation compared to other metrics that consider more search results. Since the given tasks usually have more than one sub-goal, users consider several results to cover all the information needs in the exploitation and known-item states, resulting in the RBP metric having higher correlations with user satisfaction compared to MaxR.

The correlation between online metrics and user satisfaction is overall higher on the KDD dataset than on the TianGong dataset. As mentioned in Sect. 4, users in the lab study exert more effort per query than in the field study, resulting in more user interactions. This may cause the online metrics to be more informative on the KDD dataset. MinRR even outperforms some offline metrics, resulting in the highest correlation for exploration state and exploitation state among all online metrics.

5.2.3 Implications

With the result of the correlation analysis, we can answer RQ1. First, different existing evaluation metrics reflect user satisfaction to different extents under the same task states. Taking exploration state as an example, Table 4 shows that on the KDD dataset, the offline metric MaxR has a correlation of 0.529 with user satisfaction, while the online metric ActionCount has a correlation of only 0.121. Second, the same metric reflects user satisfaction differently under different task states. For example, the SessionEnd metric achieves a correlation of 0.252 in the exploration state on the TianGong dataset, it has a correlation of only 0.084 and 0.087 in the exploitation and known-item states, respectively. We can find another example of this in the KDD dataset. Here we see a drop in correlation for the MaxRR metric of 0.432 in the exploration state, 0.356 in the exploitation state, and only 0.263 in the known-item state. Meanwhile, we have found stronger correlations to user satisfaction for offline metrics compared to online metrics. We also see substantial differences in the correlation of metrics across different datasets. This suggests that the way the search task is set up has a strong impact on how users search and evaluate search results, and therefore conclusions drawn from laboratory studies alone may be biased when applied to the real world scenario. Overall, desipite that some metrics achive moderate correlation with user satisfaction, there is still a large gap in using existing metrics for measuring user satisfaction, which demonstrates the neccasity of investigation on new metrics in this respect.

6 Search state detection

Our analysis in Sect. 5 demonstrates that the evaluation metrics reflect user satisfaction to different degrees under different task states. In order to use this result for a more precise evaluation of user satisfaction, we first need to answer RQ2: can we detect the task state of a query activity using in-session signals that can be collected automatically? In this section, we present our approach for detecting task states, which we formulate as a classification task, i.e., classifying a query into one of the defined task states. We have experimented with both feature-based machine learning models (Sect. 6.1) and deep learning-based models (Sect. 6.2).

6.1 Feature-based machine learning models

We consider four of the most commonly used feature-based classification models in our experiments, namely logistic regression (LR), k-nearest neighbors (KNN), support vector machines (SVMs), and random forest (RF). The features used by these models are:

6.1.1 Query-based features

We computed several sets of features based on query related information as follows.

  • We consider all metrics in Table 1 to be descriptive features.

  • Term frequency We also consider the original terms in the query string. The terms are represented as a term frequency vector.

  • Readability scores Query complexity has been found to evolve during the search process (Eickhoff et al., 2014) and thus may provide clues to the search state of the current query. In this work, we compute a set of query readability and complexity scores and use them as features. Many readability scores and complexity metrics have been proposed over the years. In this work, we consider the most commonly used ones, according to the findings in (Eltorai et al., 2015) and (Zhou et al., 2017):

    • The Flesch Reading Ease (FRES) (Flesch, 1979) is computed based on sentence and word length to measure whether a text is in plain English. The higher the number, the easier the text is to read. It is computed as shown in Eq. 1.

      $$\begin{aligned} FRES = 206.835 - \left( 1.015 * \frac{\#Words}{\#Sentences}\right) -\left( 84.6 * \frac{\#Syllables}{\#Words}\right) \end{aligned}$$
      (1)
    • The Flesch-Kincaid Grade Level (FKGL) (Kincaid et al., 1975) measures the education (equivalent to U.S. grade level) required to understand a text. It takes into account the relative number of words per sentence and the number of syllables per word. The higher the result, the easier the text is to read. The calculation is shown in the Eq. 2.

      $$\begin{aligned} FKGL = \left( 0.39 * \frac{\#Words}{\#Sentences}\right) + \left( 11.8 * \frac{\#Syllables}{\#Words}\right) - 15.59 \end{aligned}$$
      (2)
    • The Gunning Fog Index (GFI) introduces the concept of complex words. A complex word is defined as a word with more than three syllables. In addition to the relative proportion of complex words, the length of sentences is also considered. If the result is over 20, the text is considered difficult to read. A text with a score of 5 is readable (Eltorai et al., 2015).

      $$\begin{aligned} GFI = 0.4 * \left[ \left( \frac{\#Words}{\#Sentences}\right) + 100 * \left( \frac{\#ComplexWords}{\#Words}\right) \right] \end{aligned}$$
      (3)
    • The SMOG Index (SMOG) adopts the concept of complex words (Mc Laughlin, 1969) and measures how many years of education the average person needs to understand a text.

      $$\begin{aligned} SMOG = 1.043 * \sqrt{\#ComplexWords * \left( \frac{30}{\#Sentences}\right) } + 3.1291 \end{aligned}$$
      (4)
    • The Automated Readability Index also provides a grade. It was developed for the U.S. Air Force to determine how readable text is as it is typed. Senter and Smith chose the number of characters per word and words per sentence to assess readability. Regression analysis is used to determine the weight of the parts was based on 24 books labeled with readability (Zhou et al., 2017).

      $$\begin{aligned} \begin{aligned} ARI&= 4.71 * \left( \frac{\#Characters}{\#Words}\right) + 0.5 * \left( \frac{\#Words}{\#Sentences}\right) - 21.43 \end{aligned} \end{aligned}$$
      (5)
    • The Cole-Liau Index is based on a regression analysis of 36 150-word passages with cloze percentages. The authors included the average number of letters per 100 words and the number of sentences per 100 words. The result represents the U.S. grade level of the reader’s reading skills (Zhou et al., 2017).

$$\begin{aligned} CLI = \left[ 0.0588 * \left( \frac{\#Characters}{\#Words}*100\right) \right] - \left[ 0.296 * \left( \frac{\#Sentences}{\#Words} * 100\right) \right] - 15.8 \end{aligned}$$
(6)

6.1.2 Online metrics as features

User interaction signals are easily obtained from the search engine log and can potentially be indicators of task state. Therefore, in addition to query-based features, we also use online metrics (see Table 2) computed from in-session signals as features.

6.2 BERT-based language models

Detecting the search state early in a search session, i.e., when query terms are entered, can enable search systems to adjust ranking optimizations accordingly. Therefore, in addition to the feature-based classification approach that uses various signals in a search session as introduced in Sect. 6.1, we tried to apply the advanced language models to understand the semantics in query terms for detecting search state. Models based on Bidirectional Encoder Representation from Transformers (BERT) have been applied to many natural language processing tasks and have achieved superior performance (Devlin et al., 2018). Based on it, we develop a two-step pipeline: a pre-training step with unlabeled data, and a fine-tuning step with task-specific labeled data.

There are two different tasks in pre-training. The first task is to train the Masked Language Model (MLM). Existing systems have taken a unidirectional approach by examining the language from left to right. This approach has been further developed in the BERT model by examining the context bidirectionally. To do this, 15% of the input tokens are masked. The system then predicts the words behind the masked tokens. The second task in pre-training is Next Sentence Prediction (NSP), where the probability that sentence B follows sentence A is determined. In the fine-tuning phase, task-specific data and labels are provided to the model. The texts are encoded with pre-trained embeddings, which are then fed into an output layer for classification. The task state labels are used to train the classification model. The BERT model we realized is based on the implementation of the multi-layer bidirectional transformer encoder by Vaswani et al. (2017) from the tensor2tensor (Vaswani et al., 2018) library. The model used in this work corresponds to the \(BERT_{BASE}\) model with 12 layers, a hidden size of 768 and 12 self-attention heads (Devlin et al., 2018).

6.3 Experimental evaluation of task state prediction

To evaluate the prediction performance of the models, we compute the standard precision (P), recall (R) and F1 score for each class. To evaluate the overall result, we compute the accuracy (acc) and the macro average of precision, recall and F1 score. We perform 10-fold cross-validation on each of the two experimental datasets. The evaluation results of the search state prediction models obtained on both datasets are shown in Table 5.

Table 5 Evaluation result of search state prediction

To answer RQ2, we observed in the result that the applied models achieved over 59.8% accuracy on both datasets, demonstrating that the signals we chose are effective and that task states can be predicted from user interactions. Comparing different models, the BERT-based model using query terms achieved the best performance in terms of both accuracy and average F1 score on both datasets. Among the results of the BERT model, the highest F1 score is achieved in the known-item state, while exploration and exploitation are harder to distinguish.

With respect to the different datasets, we notice a generally lower performance on TianGong dataset. After investigating the original dataset, we think that the reasons for the low performance of the state detection models on TianGong dataset, especially for the exploration state, are due to the high topic diversity and the smaller number of samples. In terms of topic diversity, the KDD dataset is collected under the setup that all search activities are related to the 9 predefined topics, while the TianGong dataset is collected in a field study, the topics are very diverse, the overlap of search topics among participants is very small. Meanwhile, the exploration state has the least number of samples, i.e. 356, 2266, 1252 samples inexploration, exploitation and known-item state respectively in TianGong dataset. This resulted in the model not being fully trained. This can also explain that after adding online features, the performance of the model decreased even further, as there are not enough samples to train a robust model. For the feature-based classifier on the KDD dataset, adding online features on top of query-based features improves the performance of the models, suggesting that user interaction with the search engine can provide signals that indicate the task state.

7 Construction of state-aware evaluation metrics

We found that evaluation metrics correlate differently with user satisfaction under different task states and that it is possible to predict task states based on in-session signals. We now try to develop new evaluation metrics that can better assess user satisfaction and answer RQ3.

7.1 Task formulation

As a preliminary attempt, we aim at creating explainable state-aware evaluation metrics. Hence we choose to use a linear regression model to combine existing metrics and features to better measure satisfaction. The basic linear regression model for multiple features is as follows:

$$\begin{aligned} y = \beta _0 x_{0} + \beta _1 x_{1} + \cdots + \beta _p x_{p} + \epsilon \end{aligned}$$

\(X = \{x_0, x_1, \ldots , x_p\}\) is the set of features we are using. The \(\beta \)s are the coefficients for each feature, \(\epsilon \) is the bias, and y is our target, i.e., user satisfaction. We perform least squares on this formula and try to optimize the \(\beta \)s. The method we use to perform the linear regression is implemented in the scikit-learn library. To construct the new evaluation metrics by linear regression, we consider the query, online and offline evaluation metrics as introduced in Sect. 5.1 and the readability scores computed from query terms as described in Sect. 6.1 as input features. We start by fitting the linear model with all considered features on the two datasets and for each search state, respectively.

7.2 Experimental results

We fitted new metrics under two settings: (1) general metrics (\(new-all\)) that do not distinguish between task states, and (2) state-specific metrics (\(new-st\)) that are trained on state-specific data. To fit the linear model, we split the data into 70% for model fitting and 30% for evaluation. To improve the interoperability and efficiency of the fitted linear model, we applied the sequential forward selection strategy (John et al., 1994; Last et al., 2001). The highest correlations obtained under different setups are shown in Table 6.

Based on the results, we can answer RQ3—compared to existing metrics in the general setting and in different task states, we observe that the highest correlations (bolded in Table 6) are all achieved by new metrics, demonstrating that user satisfaction can be better measured by combining online signals and existing metrics. The general metrics (\(new-all\)) achieved higher correlation with user satisfaction compared to the existing metrics in most cases, with one exception (underlined in table 6) on the TianGong dataset. One possible reason is that the TianGong dataset contains fewer online signals, more diverse topics and user behavior, while the task state distribution is unbalanced. Therefore, without enough meaningful signals to distinguish the characteristics of different search states, the trained metrics are better at capturing easier patterns in the larger class known-item state, resulting in less predictive power for user satisfaction in other search states. Comparing the general and state-specific metrics, the state-specific metrics outperformed the general metrics in more cases, even with less training data. Results suggest that having state-specific metrics for certain task states can be useful to better measure user satisfaction.

Comparing between different feature groups, we observe that on both datasets and in all states, the highest correlations are obtained by using both online and offline features. Overall, the correlation values between the new metrics and user satisfaction are lower on the TianGong dataset compared to the KDD dataset. The metrics are less stable on the TianGong dataset when only online features are used. A possible reason could be that due to the diversity of tasks in the TianGong dataset, more training data is needed for the models to learn robust metrics.

Table 6 Highest correlation of evaluation metrics in each setup

The result of selecting a different number of features is shown in Fig. 3, where the x-axis shows the number of features used by the linear function, and the corresponding value on the y-axis is the correlation between the new metric and user satisfaction on the validation set.

On the KDD dataset (Fig. 3a and b), we observe that the correlations between the new metrics and user satisfaction increase up to a point as the number of features increases. This suggests that we can better assess user satisfaction with the search result by considering multiple metrics simultaneously. While for the exploration state, the correlation starts to drop rapidly after reaching a certain number of features. The drop in correlation is likely caused by model overfitting, as the exploration state has the least number of samples.

Similar trends are shown on the TianGong dataset (Fig. 3c and d), where the combination of multiple metrics can better reflect user satisfaction than a single metric. A possible overfitting effect is shown by increasing the number of features for the exploration state in both feature settings, i.e. the class with fewer samples compared to the exploitation and known-item states.

Fig. 3
figure 3

Pearson correlation between new evaluation metrics and user satisfaction. Blue: exploration state; Green: exploitation state, Red: known-item state; All: without distinguishing task states (Color figure online)

8 Discussion and conclusions

People increasingly rely on search and recommendation technologies to perform information-intensive tasks and make complex decisions. In contrast to simple, fact-finding retrieval tasks, complex search tasks often involve prolonged search sessions and motivate users to achieve different intentions and subgoals across different task states in information seeking and search episodes (Mitsui et al., 2016). While different states and intentions may affect the way users evaluate the performance of search systems (Liu et al., 2019b), how to adaptively evaluate systems and employ the appropriate metric(s) that best reflect task states and user satisfaction criteria currently remains an open challenge. This gap is unlikely to be addressed by applying one or two mainstream “unified metrics” such as precision and nDCG.

To address this challenge, our study goes beyond the mainstream Cranfield-style offline evaluation approach (Voorhees, 2001) and seeks to develop an adaptive evaluation approach by (1) meta-evaluating the performance of different metrics under different task states and (2) developing new metrics that better capture user in-situ satisfaction. In this paper, we first investigate the correlation between existing IR evaluation metrics and user satisfaction under different task states and find that the degree to which a metric can reflect user satisfaction varies across task states. The analysis extends previous work (Liu & Yu, 2021) not only by considering a more complete set of evaluation metrics but also by conducting the analysis on datasets collected from different search scenarios, i.e. both field study and task-oriented lab studies, to mitigate the impact of the experimental setup. In the next step, we experiment with the automatic detection of the task state of a search query based on in-session signals. In addition to the models used in the baseline work (Liu & Yu, 2021), we also applied a BERT-based classification model to query terms and achieved superior performance compared to other feature-based models. To better assess user satisfaction, we constructed a set of new evaluation metrics using linear regression. The results demonstrate that the new metrics can better reflect satisfaction than the existing metrics in the same category, i.e., online or offline metrics. In this study, we experimentally found the best-combined metric in each configuration. When applying the combined metric to support real-world applications, such as real-time satisfaction prediction or search result re-ranking, the metric could be further reduced depending on the objective and feature availability. For example, in real-world applications, offline metrics are not available, so it may be necessary to construct an evaluation metric based only on online signals.

The characteristics of users, such as their level of knowledge about the topic and their familiarity with search engines, can have a strong impact on their level of satisfaction with different search results. Incorporating user characteristics into the evaluation metric could potentially increase its effectiveness in measuring satisfaction. Limited by the availability of user information and the diversity of users in the available datasets, we did not experiment with user features in this work. Since it is costly to obtain user satisfaction, document relevance, and task state annotations, the main limitation of our work is the size of our experimental dataset. Meanwhile, the task state classes are unbalanced, and some of the classes have few training examples for building the classifiers in Sect. 6. For these reasons, we believe that the power of the features and classifiers is limited. However, our experiments manifested the potential of a state-aware approach in understanding users’ intentions and supporting interactive IR activities and presented new metrics that can more accurately characterize users’ in-situ experiences. Together, the behavioral features, classifiers, and state-aware evaluation metrics provide a methodological and empirical foundation for more fine-grained user modeling, adaptive search recommendations, and dynamic IR evaluation. In future work, we plan to work on heuristics that enable user search and annotation data collection on a larger scale. We will also seek to verify our results, fine-tune the task state classifiers, and meta-evaluate the new metrics on more diverse datasets, task types, and user populations.