Retrievability in an Integrated Retrieval System: An Extended Study

Retrievability measures the influence a retrieval system has on the access to information in a given collection of items. This measure can help in making an evaluation of the search system based on which insights can be drawn. In this paper, we investigate the retrievability in an integrated search system consisting of items from various categories, particularly focussing on datasets, publications \ijdl{and variables} in a real-life Digital Library (DL). The traditional metrics, that is, the Lorenz curve and Gini coefficient, are employed to visualize the diversity in retrievability scores of the \ijdl{three} retrievable document types (specifically datasets, publications, and variables). Our results show a significant popularity bias with certain items being retrieved more often than others. Particularly, it has been shown that certain datasets are more likely to be retrieved than other datasets in the same category. In contrast, the retrievability scores of items from the variable or publication category are more evenly distributed. We have observed that the distribution of document retrievability is more diverse for datasets as compared to publications and variables.


Introduction
In the present era of information, we are generating a colossal amount of data that needs to be handled and processed efficiently for quick look-ups.The expeditious advancement in technologies has made data generation even more complex with a diversified form of information coming from divergent sources.This necessitates the need to have a federated or integrated system (Adali and Emery, 1995;Arguello, 2012) that searches and assimilates results from assorted sources.Textual data still remains the predominant type among them and significant research has been conducted in the domain of textual document retrieval.Among the rest, recent research on dataset retrieval (Kunze and Auer, 2013) has become increasingly important in the (interactive) information retrieval and digital library communities.One of the reasons is undoubtedly the enormous number of research datasets available.However, the underlying characteristics of dataset retrieval also contribute to the attention in this area.One often-mentioned characteristic is the increased complexity of datasets over traditional document retrieval.While the latter is well-known and adequately studied, datasets often include more extensive material and structures that are relevant for retrieval.This may involve the raw data, descriptions of how the data was collected, taxonomic information, questionnaires, codebooks, etc.Recently, numerous studies have been conducted to further identify the characteristics of dataset retrieval.These studies include the observation of data retrieval practices (Krämer et al, 2021), interviews and online questionnaires (Kern and Mathiak, 2015;Friedrich, 2020) and transaction log analysis (Kacprzak et al, 2017;Carevic et al, 2020).
In this paper, we follow a system-oriented approach for studying dataset retrieval.By employing the measure of retrievability (Azzopardi and Vinay, 2008), we aim to gain insights into the particularities of dataset retrieval in comparison to traditional document retrieval.The measure of retrievability was initially developed to quantify the influence that a retrieval system has on access to information.In a simplified way, retrievability represents the ease with which a document can be retrieved given a particular IR system (Azzopardi and Vinay, 2008).The measure of retrievability can be utilised for several use cases.
As an extension of our prior work (Roy et al, 2022), we investigate the retrievability of various types of documents in an integrated digital library GESIS Search (see Section 3), focusing on various types of data, particularly datasets, publications and variables.The assumption followed here is that in an ideal ranking system1 , the retrievability of each indexed item (dataset or other publication) is equally distributed.Likewise, a discrepancy to this assumption may reveal an inequality between the items in a collection caused by the system.By employing a measure of retrievability, we expect to gain further insight into the characteristics of dataset retrieval compared to traditional document retrieval.

Research questions
We verify the research questions put forward and discussed by Roy et al (2022) in the updated system with a variety of item types tested with more queries (see Section 4).Similar to the previous work, we substantiate the following research questions on the integrated search system GESIS Search focusing on an additional type of item: Variables together with Publication and Dataset: • RQ1: In the integrated search system with various types of items, can we observe any prior bias of accessibility of documents from a particular type?• RQ2: Can we formalize this type-accessibility bias utilizing the concept of document retrievability?• RQ3: How diverse are the retrievability score distributions in the different categories of documents in our integrated search system?
Our previous study (Roy et al, 2022) was designed to take all queries in the query log into account.This had the benefit of being as close to the real search behaviour as possible.At the same time this design choice introduced a popularity bias caused by reoccurring queries that positively influence the retrievability score of documents in the corresponding result set.Additionally, the popularity bias of queries has been ignored in this work.Thus, contrasting with the previously reported results, we address the following research question: • RQ4: In a real-life search system, does popularity bias of queries influence the inequality in any way?
In sum, our contributions are as follows: 1) we utilize the retrievability measure to better understand the diversity of accessing datasets in comparison to publications with real-life queries from a search log; 2) building on retrievability, we propose to employ the measurement of usefulness, which represents implicit relevance signals observed for datasets and publications.Our understanding of bias follows the argumentation provided in Wilkie and Azzopardi (2017) where bias denotes the inequality between documents in terms of their retrievability within the collection.Bias can be observed when a document is overly or unduly favoured due to some document features (e.g.length, term distribution, etc.) (Wilkie and Azzopardi, 2014a).
The rest of the paper is organized as follows.We first present background and related work in Section 2 together with formally introducing the concept of retrievability.The integrated search system GESIS Search along with the motivation of our retrievability study is presented in Section 3. Section 4 discusses the empirical results and analysis of the outcome of the experimentation before introducing the novel concept of usefulness in Section 5 along with the experimental study of usefulness.We conclude the paper in Section 6 highlighting the contributions and findings of the paper with directions to extend the work.

Background and related work
Considering a collection of items, the retrievability of items can be defined as how accessible or findable the items are by some searching techniques.In context of document retrieval, the concept was developed and proposed in Azzopardi and Vinay (2008).Informally, the retrievability of a document in a collection indicates the expectation of selection of the document by some retrieval model within a rank cutoff.Mathematically, the retrievability of any document d in a collection C is defined as: where, • Q -the set of all queries which are answerable by the collection; • w q -weight of the query q; • rank(d, q, M ) -rank of the document d when retrieval is performed with query q using retrieval model M ; • c -the rank cutoff.
The function f (rank(d, q, M ), c) is an indicator function that returns either 1 or 0 depending on whether the rank (rank(d, q, M )) of document d is within the rank cutoff c or not.The indicator function can be mathematically defined as the following: In Equation 1, the retrievability of a document is computed based on retrieval performed with all sets of queries Q addressable by the document collection.Considering a sizeable collection of documents, there can be infinitely many distinct queries that can be answered by various documents in the collection.One of the practical approaches to get this set of all queries Q is to use a query log; however acquiring such a log is not always feasible.In the absence, a query-based sampling method (Callan and Connell, 2001) can be applied to randomly populate Q.In Azzopardi and Vinay (2008), the authors considered generating queries with unigrams and bigrams based on the collection frequency of them above a threshold in the collection.This approach may result in an enormous number of queries if a large collection of documents is considered.To keep the experimental setup tractable, one approach here is to truncate the list again based on a certain threshold value (e.g. 2 million as selected by Azzopardi and Vinay).Hence, the construction of Q based on either query log or random sampling of terms from the collection are some practical approximations that we can adapt in order to realize the concept of retrievability of documents in a collection.The query weight w q in Equation 1 may be used for incorporating a bias (such as popularity, importance, etc.) in the retrievability computation.Ignoring these biases, this weight is considered as uniform for all queries in earlier works (Azzopardi and Vinay, 2008;Bashir and Rauber, 2009a,c).The approximated retrievability score (r(d)) of document d will then be a discrete value x indicating the number of queries for which d is retrieved within top rank c.Certainly, this is a simplifying assumption and the queries submitted to a search system in practice vary vastly both in terms of popularity as well as difficulty (Carmel et al, 2006).
The second factor of the per-query component in Equation 1 is a boolean function that depends solely on the rank at which document d is retrieved.Increasing the value of the rank cut-off (c) broadens the domain of documents retrieved which will positively influence the retrievability scores of more documents.Note that being selected by a retrieval model for some queries does not ensure the relevance of the document which can only be assessed by human judgements.Retrievability as a measure was proposed in Azzopardi and Vinay (2008) where the authors experiment on two TREC collections with queries generated using a query-based sampling technique (Callan and Connell, 2001).Since then, Retrievability has been primarily used to detect bias in ranking systems.For instance, Samar et al (2018) employ retrievability to research the effect of bias across time for different document versions (treated as independent documents) in a web archive.Their results show a ranking bias for different versions of the same document.Furthermore, the study confirms a relationship between retrievability and findability measured by Mean Reciprocal Rank (MRR).They follow the assumption that the lower a document's retrievability score the more difficult it is to find the document.Another application of the retrievability measure can be found in patent or legal document retrieval which provides a unique use case due to its recall-oriented application.In both studies (Bashir and Rauber, 2009a,c), the authors look at document retrievability measurements and argue that a single retrievability measure has several limitations in terms of interpretability.In Bashir and Rauber (2009a) they try to improve accessibility measurement by considering sets of relevant and irrelevant queries for each document.In this way, they try to simulate recall-oriented users.In addition, they plot different retrievability curves to better spot the gaps between an optimal retrievable system and the tested system.The other work (Bashir and Rauber, 2009c) analyze the bias impact of different retrieval models and query expansion strategies.Their experiments show that clustering-based document selection for pseudorelevance feedback is an effective approach for increasing the findability of individual documents and decreasing the bias of a retrieval system.Further researches on patent retrieval reported in Bashir and Rauber (2009b) and Bache and Azzopardi (2010) identify content-based features that can be used to classify a set of documents based on their retrievability.Experiments on various patent collections show that these features can achieve more than 80% classification accuracy.
A study on the query list generation phase for determining the measure of retrievability is presented in Bashir and Rauber (2011).The study addresses two central problems when determining retrievability: 1) query selection and 2) query characteristics identification.It is argued that the query selection phase is usually performed individually without well-accepted criteria for query generation.Hence their goal is to evaluate how far the selection of query subsets provides an accurate approximation of retrieval bias.The second shortcoming is addressed by determining retrievability bias considering different query characteristics.In their experiments, they recognise that query characteristics influence the increase or decrease of retrievability scores.A topic-centric query generation technique, tested on the Associated Press (AP) document collection, is proposed in Wilkie and Azzopardi (2016).A significant correlation is reported between the traditional estimate of Gini and the estimate produced by this method of topic-centric query.As recognised in Bashir and Rauber (2011), the majority of retrievability experiments employ simulated queries to determine retrievability.To study the ability of the retrieval measure in detecting a potential retrievability bias using real queries issued by users, Traub et al (2016) conducted an experiment on a newspaper corpus.Their study confirms the ability to expose retrievability bias within a more realistic setting using real-world queries.A comparison of simulated and real queries with regard to retrievability scores further shows considerable differences which indicate a need for improved construction of simulated queries.To see if there is any correlation between the retrievability bias and performance measurement, in another study, Wilkie and Azzopardi (2014b) examine the relationship between retrieval bias and ten retrieval performance measures.Experimentation of TREC ad-hoc data demonstrates that the retrievability bias hypothesis tends to hold for most of the performance measurements.
Retrievability of documents indicates the chance of selection by a retrieval model for various queries submitted.However, the selection of a document does not mean that the document is indeed useful in addressing the information need generating the query.This can only be realised by using document consumption signals (e.g. in the form of relevance judgements).This concept was first introduced in Cole et al ( 2009) as a criterion to determine how well a system is able to solve a user's information need.In their work, Cole et al denoted this notion as usefulness.In Hienert and Mutschke (2016), it has been operationalised within a log-based evaluation approach to determine the usefulness of a search term suggestion service.The usefulness has been further operationalised in Carevic et al (2018) to determine the effects of contextualised stratagem browsing on the success of a search session.
Recently, a considerable amount of research has been carried out concerning the characteristics of dataset retrieval.A comprehensive literature review on dataset retrieval is provided in Gregory et al (2019) focusing on dataset retrieval practices in different disciplines.Research in this area covers, for instance, the analysis of information-seeking behaviour during dataset retrieval through observations Krämer et al (2021), questionnaires and interviews (Kern and Mathiak, 2015;Friedrich, 2020), and transactionlog studies (Kacprzak et al, 2017;Carevic et al, 2020).In Kern and Mathiak (2015), the authors investigated the requirements that users have for a dataset retrieval system.Their findings on dataset retrieval practices suggest that users invest greater effort during relevance assessment of a dataset.They conclude that the selection of a dataset is a much more important decision compared to the selection of a piece of literature.This results in high demands on metadata quality during the dataset retrieval.The complexity of assessing the relevance of a dataset is also highlighted in Krämer et al (2021).Besides topical relevance, access to metadata as well as documentation about the dataset plays a crucial role.A query log analysis from four open data portals is presented in Kacprzak et al (2017).Their study indicates differences between queries issued towards a dataset retrieval system and queries in web search.In a subsequent study (Kacprzak et al, 2018), the extracted queries are further compared to queries generated from a crowdsourcing task.The intuition and focus of this work is to determine whether queries issued towards a data portal differ from those collected in a less constrained environment (crowdsourcing).

Retrievability in an integrated retrieval system
We define an integrated search system as a system that searches multiple sources of different types and integrates the output in a unified framework2 .
The retrieval in such a system requires sophisticated decision-making considering the various modalities in documents in the collection of data.
Following Equation 1, the retrievability score of documents is dependent on the other documents in the collection3 : considering a rank-cut c, the rank of a document under consideration can be greater than c (> c) due to the documents, taking the top c positions, being more relevant or duplicate (Nikkhoo, 2011).Another factor that can influence the retrievability score of a document is its popularity; a popular document will be retrieved multiple times by users over time.In case of an integrated search engine, where the documents belong to various categories, some particular types could be having higher chances than others in terms of being retrieved.In general, there can be some disparity in the number of documents of various categories being retrieved which can be a result of popularity bias in the collection.This type of popularity bias can impede the satisfaction of the information need of a user, and in turn, can affect the performance of the system.The satisfaction of a user can only be realised via a direct feedback from them.In absence of such explicit information, it is strenuous, if at all possible, to understand whether information need is fulfilled or not.In this article, we are going to present an extended study of the diversity in retrievability scores for different categories of documents in the integrated search system GESIS Search4 (Hienert et al, 2019).

Experimental Study
As presented in Section 3, we use the integrated search system with various categories of documents in this work.In this section, we start by describing the data that we have used in the work along with different statistics of the data; this will be followed by the experimental evaluation of the study.

Datasets
We conduct our experimentation on the integrated search system GESIS Search containing a total of 860K indexed records (as of November 2022) in different categories such as Research Data, Publications, Variables, Instruments etc. Social science publications that are indexed in GESIS Search use and reference survey datasets, containing hundreds or thousands of questions.These questions are using so-called survey variables (variables in the following).From an information retrieval perspective, variables in GESIS Search are information objects like datasets with specific metadata elements such as question text, answer categories and frequency tables.
A screenshot showing the interface of GESIS search is presented in Figure 1.See an example of a variable description in Figure 2 and the according link to the variable record QD3 15 in GESIS Search6 .The indexed records in GESIS Search are divided into six categories based on their types, covering more than 122K publications, 64K research data (also referred to as datasets), and more than 520K Variables.Given a query, the system returns six search result pages (SERP) corresponding to each of the categories (see Figure 1).The segregation of the SERP enables us to study the retrievability of the different types.In this study, we specifically focus on the three categories having the largest number of entries, that is, dataset, publications, and variables.
In the integrated search system, the interaction of the users with the system is logged and stored in a database.A total of more than 40 different interaction types are stored covering, for instance, searches (queries), record views and export interactions etc. (Hienert et al, 2019).The export of a record belongs to an umbrella of categories including various interactions such as bookmarking, downloading or citing.These interactions are specifically useful for the application of implicit relevance feedback as they indicate a relevance of a record that goes beyond a simple record view.The interaction log of the search system provides the basis for our analysis in Section 4.4 (and later in Section 5.2).These real-user queries form the basis of determining the retrievability of documents.This ensures realistic queries in Q of Equation 1 as opposed to the simulated queries used in Azzopardi and Vinay (2008) or Traub et al (2016).The data used in this study is an extended version of our previous work (Roy et al, 2022); in this log, all the interactions of real users with the search system were recorded for a period of more than five years, specifically between July 2017 and July 2022.The log records more than 2.3 million queries submitted to the integrated search system.Detailed statistics regarding the extracted interactions utilized in our study can be found in Table 1.Together with the previous observations for record type Publication and Dataset, we report the results on another category, the Variables.
Repeated queries can influence the retrievability score of a document.Formally, the set of all queries Q in Equation 1 may contain the same queries more than once.For synthetically generated queries (used by Azzopardi and Vinay (2008) and Bashir and Rauber (2009c)), this can be avoided by keeping track of the already generated queries.However, the query log of a real-life search system records all such instances where the same queries are given multiple times by the users.This factor additionally introduces popularity bias into the reproducibility of documents in the form of query popularity.The results and observations reported in our earlier study (Roy et al, 2022) were based on this type of interaction log.In order to exclusively understand the reproducibility without the query popularity factor, we have only considered unique queries in this work.

Measuring retrievability in a collection
One way of quantifying the information coverage of a collection is by the count of queries that can  be addressed (or answered) by the items in the collection.From the traditional point of view of a web search, the most sought-after way of composing the queries is using free text where vocabulary terms are used to represent an information need.In a moderate-sized document collection, an intractable number of queries formed using a free-text format are possible.Also due to the significant number of documents that can match a free text query, a boolean matching algorithm is not sufficient; this leads to the development of ranked retrieval that returns an ordered list of items sorted based on their relevance.Considering a traditional document collection C, all the documents are not equally important to a query, hence paving the need to have a ranked retrieval.Now given a set of all possible queries Q, some documents in C will be relevant to more queries (depending on the topical coverage of the document) than others which can be measured by the concept of retrievability (see Section 2).Formally with the notion of retrievability, some documents will be having higher r(d) in a collection, resulting in an unequal distribution of retrievability scores.Similar types of inequalities are observed in economics and social sciences, and they are traditionally measured using the Gini coefficient or Lorenz curve (Gastwirth, 1972) which measures the statistical dispersion in a distribution7 .
Mathematically, the Gini coefficient (G) of a certain value v in a population P can be defined as: where N is the size of the population and v(i) specifies the value of i th item in P. The Gini coefficient in the population will be between 0 and 1 and is proportional to the inequality inherent in the population: higher value of G indicates greater disparity and vice versa.In other words, a value of G equal to 0 in Equation 3indicates that all the items in the population are equally probable to be selected whereas higher values of G specify a bias implying that only certain items will be selected.

Experimentation
As explained in Section 2, the retrievability of a document is a measurement of how likely the document will be retrieved by any query submitted to the system8 .Hence, the study of retrievability in a collection of documents requires rigorous retrieval with a set of diversified queries to cover all topics discussed in the collection.In other words, the retrievability of the documents should be calculated considering all sorts of queries submitted to the system.However, an infinite number of queries are possible to be answered by a collection of freetext queries.To cover all the topics, a traditional approximation is to simulate a set of queries randomly, accepting the risk of erratic queries not aligned with the real scenario (Azzopardi and Vinay, 2008;Traub et al, 2016).With the availability of a query log, the process of query generation can be made more formalized and streamlined to consider the actual queries submitted by real users.For the study reported in this article, we utilize the query log presented in Section 4.1.
As reported in the earlier study, the retrievability distribution in a collection depends on the employed retrieval model (Azzopardi and Vinay, 2008).Following the findings by Azzopardi and Vinay, we use BM25 as the retrieval model (Sparck Jones et al, 2000).Particularly, we use the implementation available in Elasticsearch9 which uses Lucene10 as the background retrieval model.Following Equation 1, the retrievability of a document depends on the selection of the rank cutoff value (c) -a rank threshold to indicate how deep in the ranked list are we going to explore before finding that document.Considering the model employed for retrieval and the set of all queries Q as fixed, c is the only parameter in calculating the retrievability.For a query q, setting a lower value to c will reduce the number of documents being considered retrievable because f (k(d, q), c) will be 1 only if k(d, q) ≤ c (see Equation 2).Having a higher value of c will allow more documents to be considered retrievable reducing the overall inequality.In this study, we have varied the value of c in the range 10 to 100 in steps of 10 and have analyzed the observations which are reported in the next section 11 .

Observation and analysis
We start this section with describing different statistical properties of the retrievability distribution of items (from all the three different document types that we experimented with) when the value of c is varied.The mean (µ), geometric mean (gµ), variance (σ 2 ), and standard deviation (σ) of the retrievability score distributions on different types (publication, dataset and variable) are given in Table 2.In general, it can be noticed that all the statistical measures for datasets are far more diverse than the other categories.On varying the value of c from 10 to 100, we observe a change Fig. 4: The Lorenz curve with the retrievability (rank cutoff set to 100).The straight line going through the origin (in black) indicates the equality, that is, when all the documents are equally retrievable.
of more than 140% and 220% in mean retrievability scores in case of publication and dataset respectively while only 45% change is noticed in case of variables.In comparison to our earlier work (Roy et al, 2022), we can see these changes in the retrievability scores are moderate and are not as substantial as seen before.Note that we have excluded repeated queries from the interaction log in this work which were considered in Roy et al (2022).This indicates that there is a significant number of repeated queries submitted into the system that had contributed to the momentous change reported earlier resulting in a vast diversity in retrievability scores (see Roy et al (2022), Table 2).Similar trends are recorded for variance and standard deviation as well when computed using the distribution of r(d) on all three categories with different c values.From Table 2, we can conclude that most of the statistical measurements (specifically mean, variance, and standard deviation) are higher for the datasets than publications.In comparison, the geometric mean (g-µ in Table 2) is seen to be higher for publications than datasets at the lower rank cut-offs.However, the geometric mean of retrievability of datasets surpasses that of publications at the rank cutoff 100.Combining the observation that can be drawn from geometric-mean values together with the other statistics, we can perceive that for some dataset items, the retrievability values are extensive (popular datasets retrievable by a number of queries); at the same time, there are datasets with poor r(d) values that are rarely retrieved through the submitted queries.The first category of datasets are contributing to the high mean of r(d), which is consistent across different c values, while the datasets of the second category cause the geometric-mean to fall.For the variables, we report all these measures are noticeably smaller than for publications and dataset.The reason behind this is the relatively small number of queries of the variable category compared to the other types; as a result, the variables in general are selected for less number of queries in comparison to other categories.These variations are presented graphically in Figure 3.As proposed in Azzopardi and Vinay (2008) and used in our earlier work (Roy et al, 2022), we utilize the Gini coefficient (G) to quantify the variation in retrievability scores, and Lorenz curve to graphically represent the disparity in retrievability among the items in different categories.Figure 4 plots the Lorenz curve with the r(d) scores computed separately for publications, datasets and variables.To consider the highest coverage, we set the rank cut-off c to 100 while plotting the r(d) values 12 .From the Figure 4, it is seen that retrievability of datasets (presented in Figure 4b) is more imbalanced than the other two types with Gini coefficient 0.7000.Also, variables are seen to be the closest to the equality (in Figure 4b) attaining a Gini coefficient of 0.4806.
As discussed in Section 2, the retrievability score of documents escalates with higher values of c; consequently, the overall retrievability-balance of the collection also changes positively bringing in the curve close to the equality.To empirically see this variation, Gini coefficients attained at different rank cut-offs are presented in Table 3 which is also graphically displayed in Figure 5. From the table, it can be noticed that the fall in G for variables (green curve in Figure 5) is more than 45%.From a severe unequal distribution with G having 0.8281 till rank 10 (highest among all the categories), the Gini value falls sharply at 0.4806 when the rank cut-off is set to 100.This indicates that more variables are discernible if the ranked list is explored beyond the top position.
Additionally, we report the percentage of total items retrieved while changing c in Table 3.Note that more than 92% of publication are retrieved within the top 10 positions while only 58% and 10% items respectively from the category dataset and variables are retrieved within the same rank cut-off.Increasing the value of c, it is noticed that more than 98% documents are retrievable within the top 100 ranked documents by all the queries for both publication and dataset.The significant change in the percentage of retrieved documents of type dataset indicates that searching for datasets is more complex than publications; a deeper ranked list traversal might be essential to find a relevant dataset.Note that only half of the items from variables category (specifically 50.43%) are retrieved within the top 100 positions although the Gini value indicates more balance in retrievability (G = 0.4806).This leads to an interesting observation: as reported in Table 2, the average retrievability scores for variables are significantly smaller (r(d) = 3.67 at cut-off 100), the difference in not being retrieved (having r(d) = 0) and retrieved with average retrievability score is merely a small value.Due to this seeminglyinconsequential difference in r(d) score, the Gini is not affected significantly.However, these variables, which are not retrieved at all, lowers the percentage of retrieved items.

Comparing influence of query popularity bias
Considering a real-life query log, there is an obvious possibility of having more than one entry for the popular queries.While computing the retrievability, the items retrieved by those repeated queries get a boost in the retrievability score due to the popularity bias of the queries.To understand the influence of this query popularity bias, in this section, we report relationship between the retrievability scores of the items computed with i) Q r -the interaction log containing repeated queries, and ii) Q u -the query log with only the unique queries13 .Particularly, we report how disjoint the documents with the highest retrievability scores are when the retrievabilities are computed with the two types of queries separately.If the documents are ordered by their retrievability scores, we get two individual ranked lists of documents each when Q r and Q u are employed.In order to compare and contrast the lists produced by the two types of query lists, we adapt three ways to quantify the difference: • Set-based: We compute the Jaccard's coefficient between the two lists ranked by their retrievability scores till different rank cut-offs.
Particularly, the first 1K, 5K, 10K, 20K and 50K top-ranked items are considered and their set-based overlap is computed.The results are reported in Table 4. From the results, we can see that overlap in items having the highest 1K retrievability scores are 10% and 12% respectively for the categories publication and dataset.However, around 31% overlap is observed for the variable category among top 1K items.The Jaccard's coefficient changes swiftly for all the categories when higher number of items   are considered.This indicates that the diversity between the two types of ranked lists are significant for all the three categories of items.• Correlation-based: Further, we compare the two ranked lists in terms of their correlations.
Based on the discordant and concordant pairs, we compute the Kendall's τ correlation coefficient.Additionally, the Spearman's rank correlation is also assessed and reported in Table 5 for all three categories.Considering these measures, we note that the rank correlations indicate an imperceptible relation between the two lists for all of the types while the most diverse results are observed in the case of publication category.For variables, the correlations are noted to be higher as compared to the other types whereas it is too inconsiderable for the other types.• Rank overlap-based: The correlation-based measures suffer from certain limitations such as the lists needing to be conjoint and the measurement does not consider the position where the disagreements are happening; that is, the measure does not discriminate between mismatch at top position and at later positions.As an alternative, Webber et al ( 2010) proposed a ranked-biased overlap measure (RBO) that weights the difference considering the position at which they are occurring.Mathematically, the RBO between two ranked lists S and T is computed as: In the Equation, d is the depth of the list, p is a weighting factor (between 0 and 1) and A d is the common items at depth d divided by the depth d itself.Following Webber et al, we have set the weight parameter p to 0.9.The RBObased similarity between the two types of results is reported in Table 5.Again, it is prominent from the results that the dissimilarities between the rank of the items based on their retrievability scores are noteworthy, particularly for the publication and dataset categories.
From the dissimilarities between the two ranked items of all three categories, it can be concluded that the popularity bias of queries affects the retrievability irrespective of the type.Out of the three categories, comparatively the least influence by this bias is observed for items belonging to the variable categories.The retrievability of items from the publication and dataset categories are noted to be the most impacted with less than 13% common items being observed among the top 1K.

From Retrievability to Usefulness
Usefulness was introduced in Cole et al ( 2009) and designed initially as a criterion for the evaluation of interactive search systems.The usefulness of a document can be defined as how often the document is retrieved and exported (see Section 4.1) by the end user.Of course the concept of usefulness can only reliably be recognized by relevance judgements submitted by the user for a given query and the relevance of a document may also depend on the perspective of the user which may vary across users and different points in time.Without an explicit relevance judgement, the approximation of usefulness of documents can not be reliably accomplished.Considering the availability of the export and utilisation information from the query log, we can define the usefulness of a document (u(d)) by the following equation: In Equation 5, the weight of the query (w q ) can be defined in a similar way as defined in retrievability (Equation 1).The usefulness of a document may also depend on the difficulty of the query (Carmel et al, 2006;Carmel and Yom-Tov, 2010) 14 .A document d should be considered more useful if it is retrieved and consumed following a query Q than any other document, say d with 14 A query can be considered as difficult if the top ranked documents are mostly non-relevant in which scenario, the user has to go deep down the ranked list to get the document addressing the query Carmel and Yom-Tov (2010).
an associated query Q which is relatively easier than Q (i.e.dif f iculty(Q) > dif f iculty(Q )).Hence, we extend the definition of the weight of the query taking into account a difficulty factor in Equation 6.
where the function h(q) represents the difficulty of the query q.The function g(•) in Equation 5indicates usefulness in terms of relevance of the document d for the query q.Mathematically, g(•) can be defined as follows: The function rel(d, q) in Equation 7indicates the relevance of d for the query q.It works, in the same way, f (k(d, q), c) is defined in Equation 2considering a binary relevance (that is d can be either relevant -rel(d, q) = 1, or non-relevantrel(d, q) = 0 to the query q).
Informally speaking, the usefulness of a document can be generally stated as the number of queries for which, it is exported (i.e.consumed) by the user.Considering a SERP without any duplicate documents, the usefulness can be further simplified as the count of exportation of the document.

Experimentation
As presented and argued earlier in Section 5, the signal of document consumption by the user is essential in order to compute the usefulness of documents.We utilize the information stored in the interaction log of the integrated search system GESIS Search as the indication of document consumption by the user.Particularly, the usefulness is determined on the basis of implicit relevance feedback from the export interactions (see Section 4.1).The difficulty of the query is kept as constant (h(q) in Equation 6set to 1) in this study and further study in this regard has been left as part of future work.

Observation and analysis
The experimental results on usefulness are graphically presented in Figure 6 where a pair of Lorenz curves are displayed with the usefulness of the documents of type publication, dataset and variable.indicate the equality, that is, when all the documents are equally useful.The blue (Figure 6a) and orange (Figure 6b) curves respectively specify the publication and dataset, while variable is indicated by the green curve (Figure 6c).
From Figure 6c, we can observe that the usefulness distribution of variables is close to being equally distributed as compared to the other types.In comparison, the similar distribution of datasets (presented in Figure 6b) is observed to be more skewed with an evident inclination towards certain items being more useful.The corresponding Gini coefficient of the distributions is presented in Table 6 where the value of G for the usefulness of dataset distribution is seen to be almost three times greater than the variables.The difference in publications and datasets is also evinent.This observation clearly highlights that a few datasets are more useful than the rest, whereas the usefulness distribution of the variables is considerably close to being uniform.In the case of publications, the distributions are also observed to be similar to that of variables which are close to uniformity.
Publication Dataset Variables Gini 0.3160 0.8031 0.2876 coefficient Table 6: The Gini coefficient computed with the distribution of usefulness of the publication, dataset and variables.A higher Gini coefficient (upper bound 1.0) indicates an uneven distribution of usefulness.

Conclusion and future work
In Roy et al (2022), we have reported a significant difference in retrievability of items belonging to various categories in the integrated search system GESIS Search.We particularly focused on the types publications and datasets and concluded that there is a significant difference in the retrievability scores if the item belonged to the category of publication or dataset.As an extension to that work, we have included another category to study the retrievability which is variables.Along with that, we have used a newer and larger version of interaction logs for our experimentation.A noticeable difference in the experimental setup from our earlier work is that we have used a deduplicated version of the log.That is, only the unique queries from the interaction log are considered excluding any repeated entries.This ensures bypassing any query popularity bias, which may influence the retrievability of the items.
In this extended study, we observe similar phenomena on the newer data as well as on the variable type.In response to RQ1, we have seen a significant popularity-bias with certain items being retrieved more often than others.Particularly, it has been shown that certain items from the dataset category are more likely to be retrieved than the other items in the same category.In contrast, the retrievability scores of items from variable or publication types are more evenly distributed.For the RQ2, the intra-document selection bias is formalized using the common measures of Lorenz curve and Gini coefficient.In response to RQ3, we have observed that the distribution of document retrievability is more diverse for the datasets as compared to publications.This can be attributed again to the popularity bias of certain items in the dataset category.The earlier study used an interaction log not employing any deduplication of queries; as a result, the items retrieved for those popular queries (occurring frequently in the log) gain a boost in the computed retrievability scores.In this paper, we have further included an explicit discussion and comparison of the retrievability scores of items in different categories when the query popularity bias is factored out by the deduplication of the queries.In this connection, as a response to RQ4, we showed that there can be a positive influence of the query popularity bias on the distribution of the retrievability scores.
Further study on the measurement of usefulness (proposed in our earlier work (Roy et al, 2022)) reveals a prominent diversity in the nature of consumption of items among the different types.We notice that variables are close to having an equality in usefulness which is significantly disparate in publication and dataset categories.Additionally, we have proposed a measurement of usefulness of documents based on the signal of document consumption by the users after submitting a query to the system.Experimenting with the variables, we observe that the usefulness of items in this category is closer to equality than items in the other categories.
The proposed usefulness metric indicates its popularity in terms of being consumed by the users.Hence one possible extension of this work will be to test the applicability of usefulness to improve retrieval performance.Incorporating the usefulness of documents as a feature in the learning to rank framework could actually boost the retrieval effectiveness.In terms of presenting the results (SERP) to end users, usefulness can be used as a sorting measure to organise the retrieved items based on popularity.Specifically, together with the provision of presenting the results sorted based on the recency or relevance, it can also be extended to provide an ordering based on how popularly the document is viewed by the users.

Fig. 1 :
Fig. 1: Screenshot of GESIS Search showing result sets for research data, publications, and variables.

Fig. 2 :
Fig. 2: Screenshot of the variable description of variable QD3 1 in the GESIS Search.

( a )Fig. 3 :
Fig. 3: Graphical representation of the change in various statistical measures of the observed distribution of retrievability scores.The mean, geometric-mean, variance and standard deviation of the distribution of retrievability scores of publication (in blue), dataset (in orange), and variables (in green) are presented.

Fig. 5 :
Fig. 5: The change in Gini coefficient when the rank cut-off is varied in the range from 10 to 100.The blue line indicates the publication while dataset is specified by the orange curve.

Fig. 6 :
Fig.6: Plotting Lorenz curves for usefulness values.The straight line going through the origin (in black) indicate the equality, that is, when all the documents are equally useful.The blue (Figure6a) and orange (Figure6b) curves respectively specify the publication and dataset, while variable is indicated by the green curve (Figure6c).

Table 1 :
Statistics of the extracted information belonging to the three selected record types.

Table 2 :
The mean (both arithmetic and geometric), variance and standard deviation of the retrievability values when the rank-cutoff is varied.

Table 3 :
Change in Gini coefficient when the rank cut off is increased.Also the number and percentage of documents retrieved of type Publication Dataset and Variable are presented.

Table 4 :
The Jaccard's coefficient (set-based similarity) between the ranked lists of items obtained with different query sets Q r and Q u are reported.The first column indicates the number of top retrievable items considered to compute the similarity.

Table 5 :
The rank correlation based (Kendall's τ and Spearman's r) and rank-bias based (RBO) similarities between the ranked lists of items obtained with different query sets Q r and Q u are reported.