In this section, we describe evaluations of our itm system. First, we describe fully automated experiments to help select how to build a system that can learn and adapt from users’ input but also is responsive enough to be usable. This requires selecting ablation strategies and determining how long to run inference after ablation (Sect. 6.1).
Next, we perform an open-ended evaluation to explore what untrained users do when presented with an itm system. We expose our system to users on a crowd-sourcing platform, and explore users’ interactions, and investigate what correlations users create to cultivate topics interactively (Sect. 6.2).
Our final experiment simulates the running example of a political scientist attempting to find and understand “immigration and refugee issues” in a large legislative corpus. We compare how users—armed with either itm or vanilla topic models—use these to explore a legislative dataset to answer questions about immigration and other political policies.
Simulated users
In this section, we use the 20 Newsgroup corpus (20News) introduced in Sect. 4.5. We use the default split for training and test set, and the top 5000 words are used in the vocabulary.
Refining the topics with itm is a process where users try reconcile mental models with the themes discovered by topic models. In this experiment, we posit that the users’ mental model is defined by the twenty newsgroups that comprise the dataset, e.g. “politics”, “atheism”, or “baseball”. These topics have natural associations with words. For example, the words “government” and “president” for “politics”, and “skeptic” and “reason” for “atheism”. As a user encounters more data, their mental models will become more defined; they may only have a handful of words in mind initially but will gather more words as they’re exposed to data.
We can simulate these mental lexicons by extracting words from the 20News dataset. For each newsgroup, we rank words with high information gain (ig)Footnote 11 for each category. We then simulate the process of building more precise mental models by gradually adding more words with high ig.
Sorting words by information gains discovers words that should be correlated with a newsgroup label. If we believe that vanilla lda lacks these correlations (because of a deficiency of the model), topics that have these correlations should better represent the collection. Intuitively, these words represent a user thinking of a concept they believe is in the collection (e.g., “christian”) and then attempting to think of additional words they believe should be connected to that concept.
For the 20News dataset, we rank the top 200 words for each class by ig, and delete words associated with multiple labels to prevent correlations from merging. The smallest class had 21 words remaining after removing duplicates (due to high overlaps of 125 overlapping words between “religion.misc” and “christian”, and 110 overlapping words between “religion.misc” and “alt.atheism”), so the top 21 words for each class were the ingredients for our simulated correlations. For example, for the class “christian”, the 21 correlated words include “catholic, scripture, resurrection, pope, sabbath, spiritual, pray, divine, doctrine, orthodox ”. We simulate a user’s itm session by adding a word to each of the twenty positive correlations until each of the correlations has twenty-one words.
We evaluate the quality of the topic models through an extrinsic classification task. We represent a document’s features as the topic vector (the multinomial distribution θ in Sect. 3) and learn a mapping to one of the twenty newsgroups using a supervised classifier (Hall et al. 2009). As the topics form a better low-dimensional representation of the corpus, the classification accuracy improves.
Our goal is to understand the phenomena of itm, not classification, so the classification results are well below state of the art. However, adding interactively selected topics to state of the art features (tf-idf unigrams) gives a relative error reduction of 5.1 %, while adding topics from vanilla lda gives a relative error reduction of 1.1 %. Both measurements were obtained without tuning or weighting features, so presumably better results are possible.
We set the number of topics to be the same as the number of categories with the goal of the final twenty topics capturing the “user’s” desired topics and hope the topics can capture the categories as well as additional related information. While this is not a classification task, and it is not directly comparable with state of the art classifiers like SVM, we expect it performs better than the Null baseline, which is proved by Figs. 10 and 11.
This experiment is structured as a series of rounds. Each round adds an additional correlation for each newsgroup (thus twenty words are added to the correlations per round, one per newsgroup). After a correlation is added to the model, we ablate topic assignments according to one of the strategies described in Sect. 5.1, run inference for some number of iterations, extract the new estimate of the per-document topic distribution, learn a classifier on the training data, and apply that classifier to the test data. We do 21 rounds in total, and the following sections investigate the choice of number of iterations and ablation strategy. The number of lda topics is set to 20 to match the number of newsgroups. The hyperparameters for all experiments are α=0.1, β=0.01 for uncorrelated words, β=100 for positive correlations and β=10−6 for negative correlations.
We start the process after only 100 iterations of inference using a vanilla lda model. At 100 iterations, the chain has not converged, but such small numbers of iterations is a common practice for impatient users initially investigating a dataset (Evans 2013; Carbone 2012).Footnote 12 After observing initial topics, the user then gradually updates the topics, allowing inference to continue.
Moreover, while the patterns shown in Fig. 11 were broadly consistent with larger numbers of iterations, such configurations sometimes had too much inertia to escape from local extrema. More iterations make it harder for the correlations to influence the topic assignment, another reason to start with smaller numbers of initial iterations.
Investigating ablation strategies
First, we investigate which ablation strategy best incorporates correlations. Figure 10 shows the classification accuracy of six different ablation strategies for each of 21 rounds. Each result is averaged over five different chains using 10 additional iterations of Gibbs sampling per round (other numbers of iterations are discussed in Sect. 6.1). As the number of words per correlation increases, the accuracy increases as models gain more information about the classes.
To evaluate whether our model works better, we first compare our model against a baseline without any correlations. This is to test whether the correlations help or not. This baseline is called Null, and it runs inference for a comparable number of iterations for fair comparison. While Null sees no correlations, it serves as a lower baseline for the accuracy but shows the effect of additional inference. Figure 10 shows that the Null strategy has a lower accuracy than interactive versions, especially with more correlations.
We also compare our model with non-interactive baselines: All Initial and All Full with all correlations known a priori. All Initial runs the model for the only the initial number of iterations (100 iterations in this experiment), while All Full runs the model for the total number of iterations added for the interactive version. (That is, if there were 21 rounds and each round of interactive modeling added 10 iterations, All Full would have 210 iterations more than All Initial). All Full is an upper baseline for the accuracy since it both sees the correlations at the beginning and also runs for the maximum number of total iterations. All Initial sees the correlations before the other ablation techniques but it has fewer total iterations.
In Fig. 10, both All Initial and All Full show a larger variance (as denoted by bands around the average trends) than the interactive schemes. This can be viewed as akin to simulated annealing, as the interactive settings have more freedom to explore in early rounds. For topic models with Doc or Term ablation, this freedom is limited to only correlated words or words related with correlated words. Since the model is less free to explore the entire space, these ablation strategies result in much lower variance.
All Full has the highest accuracy; this is equivalent to where users know all correlations a priori. This strategy corresponds to an omniscient and infinitely patient user. Neither of these properties are realistic. First, it is hard for users to identify and fix all problems at once. Often smaller problems are not visible until larger problems have been corrected. This requires multiple iterations of inspection and correction. Second, this process requires a much longer waiting time, as all inference must be rerun from scratch after every iteration.
The accuracy of each interactive ablation strategy is (as expected) between the lower and upper baselines. Generally, the correlations will influence not only the topics of the correlated words, but also the topics of the correlated words’ context in the same document. Doc ablation gives more freedom for the correlations to overcome the inertia of the old topic distribution and move towards a new one influenced by the correlations.
How many iterations do users have to wait?
For a fixed corpus and computational environment, the number of iterations is the primary factor that determines how long a user has to wait. While more iterations can get closer to convergence, it also implies longer waiting time. So we need to balance convergence and waiting time.
Figure 11 shows the effect of using different numbers of Gibbs sampling iterations between rounds. For each of the ablation strategies, we run 10, 20, 30, 50, 100 additional Gibbs sampling iterations. As expected, more iterations increase accuracy, although improvements diminish beyond 100 iterations. With more correlations, additional iterations help less, as the model has more a priori knowledge to draw upon.
For all numbers of additional iterations, while the Null serves as the lower baseline for accuracy in all cases, the Doc ablation clearly outperforms the other ablation schemes, consistently yielding a higher accuracy. Thus, there is a benefit when the model has a chance to relearn the document context when correlations are added, and Doc provides the flexibility for topic models to overcome the inertia of the old topic distribution but does not throw away the old distribution entirely. The difference is greater with more iterations, suggesting Doc needs more iterations to “recover” from unassignment.
The number of additional iterations per round is directly related to users’ waiting time. According to Fig. 11, more iterations for each round achieves higher accuracy, while increasing wait time. This is a tradeoff between latency and model quality, and may vary based on users, applications, and data.
However, the luxury of having hundreds or thousands of additional iterations for each correlation would be impractical. For even moderately sized datasets, even one iteration per second can tax the patience of individuals who want to use the system interactively. Studies have shown that a long waiting time may affect cognitive load, making it harder for a user to recall what they were doing or the context of the initial task (Ceaparu et al. 2004). Based on these results and an ad hoc qualitative examination of the resulting topics, we found that 30 additional iterations of inference was acceptable; this is used in later experiments, though this number can vary in different settings.
Users in loop
To move beyond using simulated users adding the same words regardless of what topics were discovered by the model, we needed to expose the model to human users. We solicited approximately 200 judgments from Mechanical Turk, a popular crowd-sourcing platform that has been used to gather linguistic annotations (Snow et al. 2008), measure topic quality (Chang et al. 2009; Stevens et al. 2012), and supplement traditional inference techniques for topic models (Chang 2010). After presenting our interface for collecting judgments, we examine the results from these itm sessions both quantitatively and qualitatively.
Figure 12 shows the interface used in the Mechanical Turk tests. The left side of the screen shows the current topics in a scrollable list, with the top 30 words displayed for each topic.
Users create correlations by clicking on words from the topic word lists. The word lists use a color-coding scheme to help the users keep track of which words are already in correlations. The right side of the screen displays the existing correlations. Users can click on icons to edit or delete each one. The correlation being built is also shown in its own panel. Clicking on a word will remove that word from the current correlation.
Users were not given a specific goal; instead, they were instructed to add correlations between words so that the topics (we called them “word groups” in the instructions) made more sense. This was intentionally underspecified, as we wanted to see what would happen when itm was placed in the hands of untrained users.
As in Sect. 6.1, we can compute the classification accuracy for users as they add words to correlations. The best users, who seemed to understand the task well, were able to increase the classification accuracy (Fig. 13). The median user, however, had an accuracy improvement indistinguishable from zero. Despite this, we can examine the users’ behavior to better understand their goals and how they interact with the system.
The correlation sizes ranged from one word to over forty. The more words in the correlation, the more likely it was to noticeably affect the topic distribution. This observation makes sense given our updating method. A correlation with more words will probably cause the topic assignments to be reset for more documents.
Most of the large correlations (more than ten words) corresponded to the themes of the individual newsgroups. Some common themes for large correlations were:
-
Themes that matched a single newsgroup: religion, space exploration, health, foreign countries, cars, motorcycles, graphics, encryption
-
Themes that spanned multiple related newsgroups: sports, government, computers, cars/motorcycles
-
Themes that probably matched a sub-topic of a single newsgroup: homosexuality, Israel, computer programming.
Some users created correlations with both “baseball” and “hockey” words, while others separated them. (“baseball” and “hockey” are in separate newsgroups.) The separate correlations often contained overlapping words. Even so, the choice of combined vs. separate correlations almost always determined whether baseball and hockey would be in the same topic in the model. A similar situation occurred with “cars” and “motorcycles”, which are discussed in separate newsgroups.
Some users created inscrutable correlations, like {“better”, “people”, “right”, “take”, “things”} and {“fbi”, “let”, “says”}. They may have just clicked random words to finish the task quickly. While subsequent users could delete poor correlations, most chose not to. Because we wanted to understand broader behavior we made no effort to squelch such responses.
The two-word correlations illustrate an interesting contrast. Some pairs are linked together in the corpus, like {“jesus”, “christ”, “solar”, “sun”}. With others, like {“even”, “number”} and {“book”, “list”}, the users seem to be encouraging collocations to be in the same topic. However, the collocations may not be present in any document in this corpus.
Not all sensible correlations led to successful topic changes. Many users grouped “mac” and “windows” together, but they were almost never placed in the same topic. The corpus includes separate newsgroups for Macintosh and Windows hardware, and divergent contexts of “mac” and “windows” overpowered the prior distribution.
Other correlations led to topic changes that were not necessarily meaningful. For example, one user created a correlation consisting of male first names. A topic did emerge with these words, but the rest of the words in that topic seemed random. This suggests that the set of male first names aren’t associated with each other in the corpus. Preliminary experiments on newspaper articles had similar correlations that created a more meaningful topic associated with obituaries and social announcements.
Finally, many correlations depend on a user’s background and perspective, showing the flexibility of this approach. Some users grouped “israeli”, “jewish”, “arab”, and “muslim” with international politics words, and others with religion words. On the other hand, “christian” was always grouped with religion words. The word “msg” appears to have two different interpretations. Some users grouped it with computer words (reading it as a message), while others grouped it with health words (reading it as a food additive).
As mentioned in Sect. 3, topic models with a tree-based prior can represent situations where words have multiple meanings. In previous work, the paths in the tree—provided by WordNet—correspond to the distinct meanings of a word (Boyd-Graber et al. 2007). Users found the formalism intuitive enough to build their own small WordNets to distinguish the different meanings of “msg”.
User study
New systems for information access are typically investigated through task-based user studies to determine whether the new approach allows users to complete specific tasks as well as with current systems. Wacholder and Liu (2008), for example, compared traditional paper-based book indices with full-text search for answering questions in large text collections. Following their lead, we compare the information-seeking effectiveness using both interactive and non-interactive topic modeling.
We asked users to fill the role of the running example a political scientist attempting to find legislation relevant to “immigration and refugee issues” (among other topics). Using full-text search aided by either vanilla topic models or interactive topic models (itm), users were asked to answer questions based content in a collection of legislative debates.
We found that users were able to answer the questions equally well in both groups: with itm (experimental group) and without itm (control group). However, users in the group using itm had radically different strategies for how they found information in the corpus. Rather than relying on full-text search, users used topic models to find relevant information.
Legislative corpus
In the process of becoming a law, potential US legislation is sponsored by a congressperson and introduced for debate by a committee in either the US House of Representatives (lower chamber) or the US Senate (upper chamber). Once introduced, the bill is debated within the chamber it was introduced. Our corpus contains transcripts of these debates for the 109th congress, which served during the 2005 and 2006 calendar years.
The corpus is available online from GovTrack.Footnote 13 Each page is associated with a bill and a vote. Uninteresting procedural bills, with less than 20 % “Yea” votes or less than 20 % “Nay” votes, are removed. We selected a subset of this congressional debate dataset that includes ten bills and their associated debates. Each debate has multiple turns (a single uninterrupted speech by a unique congressperson), and we use each turn as a document for topic modeling. This yields 2,550 documents in total; we ignore all temporal, speaker-related, or legislative organization. While this is somewhat unrealistic for a real-world study of legislative information, we will use some of this discarded information to aid evaluation. The subset includes bills on immigration, the estate (death) tax, stem cell research, and others. Detailed information can be found in Appendix A.
itm interface for exploring text corpora
The itm interface is a web-based application.Footnote 14 In contrast to the interface discussed in Sect. 6.2, it provides a comprehensive interface for navigating source documents, searching, viewing topics, and modifying topics. It provides a workflow for users to select model parameters (corpus and number of topics), create an initial topic model, name the topics, and refine the topics using itm. The interface also provides multiple ways for a user to explore the corpus: a full-text search over all documents, a full-text search within a single topic, a listing of documents associated with each topic, and links to access specific documents. We walk through this workflow in detail below.
From the initial screen (Fig. 14), users specify the session information, such as user name, corpus, number of topics, etc. Once users click “start”, the interface loads the initial set of topics, including the top topic words and related documents, as shown in Fig. 15. The top topic words are displayed such that the size of a word is proportional to the probability of this word appearing in the topic.
After clicking on the topic, users can view additional information and, most importantly, edit the topic (editing is disabled for the control group). After clicking on a topic, three “bins” are visible: all, ignore, important. Initially, all of the topic words are in the “all” bin. As shown in Fig. 16, users can drag words to different bins based on their importance to the topic: words that are important to the topic to the “important” bin, words that should be ignored in this topic to the “ignored” bin, and words that should be stopwords in the whole corpus to “trash”. Users can also add new words to this topic by typing the word and clicking “add”.Footnote 15
Once the user has finished editing a topic, changes are committed by pressing the “Save” button. The backend then receives the users’ feedback. The model adds a positive correlation between all words in the “important” bin, a negative correlation between words in “ignored” bin and words in “important” bin, and removes words in the “trash” from the model. With these changes to the model, the itm relearns the topics and updates the topics. While in principle users may update the topics as many times as they wish, our study limited a user’s exploration and modification of topics to fifteen minutes. Then, the users entered the next phase of the study, answering questions about the corpus.
In the question answering phase (Fig. 17), users have three options to explore the data to answer the questions: by reading through related documents associated with a topic, searching through all of the documents through full-text search, or via a text search restricted to a single topic. The full-text search is important because it is a commonly used means of finding data within large corpora (Shneiderman et al. 1997) and because it has been used in previous information-seeking studies (Wacholder and Liu 2008). Initial studies, where access to the data was restricted to only topic model information, were too difficult. We expect users to use topics when they are useful and use full-text search when topics are less useful in answering a question. After each question, users click “Next question” to proceed; users cannot return to previous questions.
User population
To evaluate the effectiveness of itm for information-seeking tasks, we compare the performance of users in two groups: and experimental group (itm) and a control group (vanilla LDA).
For the experimental group, users start with an initial set of topics and can refine the topics using itm for up to fifteen minutes. They then start the test phase for thirty minutes. They are provided with the refined topics for use during the test.
The control group also has access to the initial topics, but they cannot refine the topics. They are given up to fifteen minutes to check the topics, rename the topics, and review documents associated with the topics. This is to avoid experimental differences caused by the experimental group benefiting from exploring the corpus rather than from interactive topic modeling. After spending up to fifteen minutes exploring the corpus, the control group also has thirty minutes to answer the test questions.
The study participants are randomly assigned to a group. Each participant views a video explaining how to use the interface and do the test. During the study, the system logs the related information of each user. After the study, participants complete a survey on their educational/technical background and familiarity with legislation or topic models.
The study had twenty participants (ten for each group). All of the users are fluent in English. Participants are either students pursuing a degree in computer science, information science, linguistics, or working in a related field. A post-test user survey revealed that most users have little or no knowledge about congressional debates and that users have varied experience with topic models.
We designed ten free response questions by exploring this legislation corpus, including questions regarding legislation which deals with taxes, the US-Mexico border, and other issues. The full text of the questions appears in Appendix B.
User study analysis
We examined two aspects of the experiment: how well the experimental group’s final topics replicated ground-truth annotations (below, we refer to this metric as refine) and how well both the groups answered the questions (test).
Our experiment views the corpus as an unstructured text collection (a typical use case of topic models); however, each turn in the dataset is associated with a single bill. We can view this association as the true clustering of the dataset. We compare this clustering against the clustering produced by assigning each document to a cluster corresponding to its highest-probability topic.
We compare these reference clusters to the clusters produced by itm using variation of information (Meilă 2007). This score has a range from zero to infinity and represents the information-theoretic “distance” between two partitions (lower is better). Using this information, we compute the variation of information (Meilă 2007) between the true labels and the topic modeling clusters. While we have a good initial set of topics (the initial variation of information score is low), users in the experimental group—who claimed to have little knowledge about the legislative process—still can reduce this score by refining the topics. To avoid bias from users, users do not know that their topics will be evaluated by variation of information.
As shown in Fig. 18, ten users in the experimental group started with the same initial topics and refined refine the topics for multiple rounds. In the given fifteen minutes, some users played with itm for up to eight rounds while one user only tried two rounds. Although users occasionally increased the variation of information, by the end of the refinement phase a majority of users successfully reduced the variation of information of the topics.
User “x2” provides an example of a successful itm round. The user saw a topic mixing “energy”-related words with other words. To make a coherent topic about “energy”, they put “oil”, “natural gas”, “gas”, “production” and “resources” in the important bin, and put “patriot_act”, “federal_government”, “tex_cuts”, “stem_cell” into the ignored bin. After updating, this topic became a coherent topic about “energy”. After refining topics for eight rounds, they successfully made other topics more coherent; he named these topics “homeland security”, “immigration”, “abortion”, “energy”, “flag burning”, etc., which match well with the corpus’s true clusters. Thus this user successfully reduced the variation of information as shown in Fig. 18.
In addition to evaluating the variation of information for the experimental group, we also evaluated the users’ answers to content-specific questions. While the difference between the groups’ performance was not statistically significant, itm changed the usage pattern to favor topic models over full text search.
To evaluate the test, we graded the answers and compared the scores of users in two groups. Of the 20 participants, two didn’t use their session name correctly, meaning the interface didn’t store their answers properly, and one user encountered an issue and wasn’t able to finish the questions. Thus we have complete answers for 17 participants. Each question was graded by two graders with Scott’s π agreement 0.839 (Artstein and Poesio 2005). While there is no significant difference between the two groups’ test scores, the scores for experimental group had a much smaller variance compared to the control group.
To better understand how users answer the questions, the itm system logs the number of full-text searches that include words from any of the topics (queried-topic-words) and the number of times that users used topics to filter query results (query-in-topic).
The process of modifying topics inspired users in the experimental group to use queries that included words from the topics (Fig. 19); this may be because users learned more key terms while exploring and refining the topics. These topic words are helpful to answer questions: users in experimental group queried topic words an average of 27.8 times, while the control group queried topic words 18.2 times on average. Users in the experimental group also used “query-in-topic” (restricting a full text search within a topic) more than the users in control group. This is probably because those users working with refined topics that are better aligned with the underlying bills (several questions were about specific bills).
We also found that users in both groups click topics much more when the question is about the general understanding of the data set, for example, “Name 5 of the debated legislation in this data set.”. For more detailed questions like “The Gulf of Energy Security act will provide revenue streams for which fund?”, users in both groups prefer using text query directly.
However, Fig. 19 shows a large variance, so we should not overstate these results. In the conclusion, we discuss additional studies that can untangle the usefulness of topic models for evaluating information-seeking from other effects such as how familiar users are to topic models, whether they understand the task clearly, and whether they are effective consumers of information.
Some users in the control group also performed very well. For example, user “x5” in the control group obtained a high score. During the initial fifteen minute exploration phase, this user clicked on topics to review documents 71 times, substantially more than any user in either the control group or the experimental group. Users such as “x5”, who are topic model-savvy have better intuitions about how topic models work and how they can be used to help explore a corpus. In the post-session survey, the user reported that the interface, designed to facilitate itm (but disabled for the control group) helped them understand the corpus and answer the questions.
Not all users in the experimental group performed well on the task. One user only refined two topics, and some users failed to improve the topics (failed to reduce the variation of information). Some users complained that they weren’t given enough time to update the topics.
In general, most reported liking the interface. Users from both the experimental group and the control group commented that the topics helped them answer some of the questions. Some users also commented that some of the questions were too detailed, suggesting that perhaps additional methods to search the corpus may be helpful.
This study provides evidence that the itm interface assists users in exploring a large corpus and that topic modeling is helpful for users attempting to understand legislative documents. Users used itm to improve the initial clusters; this is especially promising, as these users had little background knowledge of congressional debates and few had familiarity with topic models.