Query Term Selection
Microsoft Corporation provided us with access to the database of the Bing log files for individual user searches during November and December 2010 and January 2011. All search engines store data from user sessions in detailed logs. The Bing logs contain recorded observations for each of the millions of Bing user queries, including for each query: a record of the date and time; all websites that were displayed on the SERP generated from the search; each website’s position on the SERPs; and which websites were clicked. For each website that appeared in a set of search results, we know at what rank it appeared in each view and whether it was clicked on during that view.
In order to isolate the impact of website relevance from that of page rank, we need query terms where the website relevance to the user query remains reasonably constant during the time period of study, while the ranking of websites varies (even if only slightly). We also need to eliminate as far as possible other confusing influences. To find suitable data, we first categorized a list of available query terms and then eliminated the non-suitable categories until we arrived at a final list of queries.
A first type of unsuitable query is one that generates what are known are “highly monetized” results. For example, the query term “airline tickets” signals the intent to shop for airline tickets on-line and, because it is defined in generic terms, occurs with relatively high frequency. The intent to make a purchase and the high frequency make this query attractive to the advertisers and the results page is highly monetized: There is a large volume of ads. The ads distract from the algorithmic results and introduce more “noise” into the algorithmic click behavior data. In order to predict click behavior on the algorithmic results we would need to know all of the paid results as well (whose presence might well be endogenous). As a consequence, these queries are not suitable for our analysis.
A second type of unsuitable query is what is known as “superfresh”. Consider the query term “Obama approval rating”. The intent is to look for current news, and every day (sometimes every hour) a different set of websites will be most relevant and appear in the top ranks. This variability in website relevance, which we cannot directly observe and for which we cannot control, makes such query terms unsuitable for our analysis.
A third type of unsuitable query is “navigational”: Where the user has a prior intent to navigate to a specific website. An example of this is one of the most frequent queries—“facebook”—and the search results display the different subpages of this website. Although a large proportion of query terms have some corresponding domain name and thus could in theory be navigational, queries become unsuitable for our purposes only when such query terms regularly appear in among the top results on the page.
Finally, query terms that arise from non-uniform intent across users are also unsuitable. One example is the query “eclipse”. Based on the websites that are displayed on the results page, this search has at least three possible intents: to learn about a solar or lunar eclipse, to find information about a software product that is known as Eclipse, and to search for one of the Twilight Saga books with this title (which is a teenage vampire romance novel).
Thus, we manually sorted through an extensive list of queries, and found four query terms that were suitable for our purposes. In alphabetical order these are: “Free Movies”, “Fun Games”, “Phone Numbers” and “Sports”.Footnote 3 Although some of these query terms are now monetized, none were so at the time of our study. None related to newsworthy events that might have had an impact on relevance. None were primarily navigational, and none showed significant evidence of non-uniform intent.Footnote 4
Algorithms are sometimes patented (the Google PageRank algorithm is covered by U.S. Patent No. 6,285,999) and exact formulas are held as trade secrets. However, the general characteristics of search algorithms are known. The paper that introduced Google Brin and Page (1998) states that “Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.” The fundamental ranking techniques of a search engine algorithm depend on natural language processing of the content of websites, topological analysis of the connections between websites, and analysis of the interactions of consumers with search results, among other things.
A Search Engine algorithm proceeds in two steps: choosing the websites that match the query term and then putting them in ranking order. The first step uses keyword focused measures, which examine the placement and count of the query term words in a website name and anchor text.Footnote 5 Once the set of websites to be displayed in the SERP is determined, they are ranked using natural language techniques, static rankFootnote 6 and user behavior data, such as prior website traffic and prior CTR.
This obviously raises a concern about reverse causality: It may be previous CTRs that determine ranking rather than ranking that determines future CTRs. Based on discussions with the engineers who provided us with the Bing data, we believe that at the time of our study (11/1/2010–1/31/2011), and for our selected query terms, the Bing algorithm relied on website CTRs that were calculated over long prior periods of time, and was refreshed only occasionally. As we illustrate further below, fluctuations in the CTR over short periods of time do not seem to be a determinant in Bing ranking for the query terms that we selected.
During the study period, some instability remained in the relatively new Bing algorithm, which can cause variation in ranks and is most probably the cause of the variation in page rank in our data.Footnote 7 In addition, during this study period, the results of the Bing algorithm were not personalized to user characteristics, which further alleviates many potential data concerns.
Our sample consists of those websites that appear on Bing on the first SERP (in positions 1–10) for each of the four query terms considered. “Free Movies” resulted in views for 262 such distinct websites, “Fun Games” for 158, “Phone Numbers” for 322, and “Sports” for 996.
However, not all websites had views in all ten positions. As an illustration, Table 1 displays the top five websites (as determined by the total number of views for the time period of our analysis) for the query term “Phone Numbers”; they are displayed in the order of frequency of appearance in Rank 1.
For each of the five websites, Table 1 shows how many views each website had in each rank during our sample period, and what the website CTRs were in each rank. For example, website phonenumbers.com had 17,075 views in rank 1, and 29.5 % of the views resulted in a click-through (CTR is 0.295). The statistics for each query term show that being in the top rank is associated with higher CTRs for each domain.
In addition, the frequency with which the top three websites appear in the top rank is also often, though not always, reflected in the ordering of their CTRs when they appear in the second rank, suggesting that some of the ranking frequency may reflect perceived website relevance. In particular, two websites—phonenumber.com and whitepages.com—are competing for the top spot on the page. Phonenumber.com has 17,075 views in rank 1 (with top rank CTR of 0.295) and whitepage.com has 14,652 (CTR is 0.274): When one website is in rank 1, the other website is usually displayed in rank 2. Phonenumber.com is slightly more relevant to the user query, since it is being clicked on more often in nearly every rank compared to whitepages.com.Footnote 8 This is consistent with the observation that phonenumber.com is observed in rank 1 more often.
Tables 2, 3 and 4 present the same statistics for the other three query terms, and display broadly similar characteristics.
These data naturally raise the question of what triggers changes in ranking. In particular, we are interested in whether the data are consistent with our claim that changes in ranking are more likely to reflect random events than to have been triggered by prior changes in CTRs. To examine this further, Fig. 1 has the time series of the daily CTR (dotted line) and daily percent of views in Rank 1 (solid line) for the two leading websites for the “Phone Numbers” query.
Our main concern is whether the changes in CTR trigger the switch between the ranks for these websites. This does not appear to be the case. It is easy to observe the level change in CTR once a website is displayed in Rank 1 more often, and the changes in CTR appear to occur after—rather than to precede—the switch between the ranks.
However, visual inspection of Fig. 1 is not the way to settle the question. Our conjecture is confirmed by Granger causality tests that were run for both websites. The summary statistics of the sample used for the Granger causality tests can be found in Table 5, and the results of the Granger causality tests are reported in Table 6. Note that the proportion of views in which the domain appears in rank 1 may sum to greater than unity across websites: For instance, it would be possible for two domains each to appear in rank 1 whenever they are viewed (thus 100 % of the time), provided the domains are never viewed on the same page
To determine the direction of causality between daily percentage of views in which the website appears in Rank 1 and its daily CTR, we perform a Wald test for the null hypothesis that lagged values of the former can be excluded from a regression of the latter, and vice versa. For the “Phone Numbers” query, we can clearly reject the null hypothesis that prior page rank has no effect on current CTRs: The F-statistic for the exclusion of the percentage of time spent in Rank 1 from the equation for CTR is significant at 1 % for one domain and 0.1 % for the other. On the other hand, we fail to reject the null hypothesis that prior CTR has no effect on current page rank.
For the other queries the evidence is more mixed. For “Sports” the results are similar to “Phone Numbers” but at slightly lower levels of significance (5 %). For “Fun Games” there is no evidence of Granger-causality in either direction, while for “Free Movies” there is evidence of two-way causality for one domain and none for the others.
Overall, for two query terms we can clearly accept the hypothesis, suggested to us by Bing engineers, that prior CTR is not used to determine the rank of the website. For the other query terms there is evidence of possible influence of CTR on page rank for only one of the domains used. On balance the hypothesis of lack of reverse causality seems broadly plausible given the evidence available to us.