Easing Legal News Monitoring with Learning to Rank and BERT
- 3.4k Downloads
While ranking approaches have made rapid advances in the Web search, systems that cater to the complex information needs in professional search tasks are not widely developed, common issues and solutions typically rely on dedicated search strategies backed by ad-hoc retrieval models. In this paper we present a legal search problem where professionals monitor news articles with constant queries on a periodic basis. Firstly, we demonstrate the effectiveness of using traditional retrieval models against the Boolean search of documents in chronological order. In an attempt to capture the complex information needs of users, a learning to rank approach is adopted with user specified relevance criteria as features. This approach, however, only achieves mediocre results compared to the traditional models. However, we find that by fine-tuning a contextualised language model (e.g. BERT), significantly improved retrieval performance can be achieved, providing a flexible solution to satisfying complex information needs without explicit feature engineering.
KeywordsProfessional search Complex information needs BERT
In information retrieval (IR), there has been a long standing interest in professional search, as demonstrated by various TREC tracks dedicated to a diverse range of professional domains [4, 8]. Unlike traditional Web search, an important characteristic of professional search is the complex information needs of the professional users. For instance, a professional user may ask for information within certain time range, written in a professional quality .
Although there have been ongoing discussions and studies calling for search systems addressing common issues faced by professional search, solutions typically rely on dedicated databases or specialised search strategies that are backed by ad-hoc retrieval models [20, 23]. Meanwhile, although traditional retrieval models as well as learning to rank (L2R) approaches have made rapid advances in Web search, retrieval models that cater to the diverse requirements in professional search tasks are not widely developed.
In this paper, we study a case in the context of legal professional search and investigate how different retrieval approaches can be employed to address the complex needs of professional users. The work task of our users is to monitor a number of legal topic of their interest in news and select articles to be included in a report periodically according to a set of clearly defined criteria, ranging from topical relevance to language quality. Like in many other professional search scenarios, our users setup their searches against a news stream with complex Boolean queries  where results are ranked in chronological order; and they deem recall an important metric as they do not want to miss relevant articles.
RQ1. Do traditional IR models help our users in identifying relevant documents more effectively compared to the Boolean search practice?
RQ2. Can we provide better results by adopting a L2R approach to satisfy users’ complex information needs beyond topicality?
RQ3. Can we further improve the quality of the search results by fine-tuning a pre-trained language model on our search task?
Unlike simulation based studies such as TREC tasks where the information needs, relevance criteria and judgements are set by different parties rather than the actual users, the complex information needs in our study come from real users, who also define the relevance criteria and provide judgements. Our study not only reveals the practical challenges for professional search systems, but also demonstrates possible solutions to effectively address these challenges, and;
We also contribute to the generic solution to professional search (i.e. search with complex information needs). Our experiments show the potential of employing pre-trained contextualised language models to learn relevance criteria without handcrafted features, which leads to a flexible solution that adapts to varying complex needs.
In this case study, our users have three relevance criteria: topical relevance, factual information, and language quality. Specifically, the topicality of retrieved articles must be associated with a specific legal area1; only factual articles are considered relevant; and articles written in technical language and linguistically accurate are preferred.
We first explore the effectiveness of traditional models in satisfying the users’ needs. We include four models: TF-IDF, BM25, unigram Language Model (LM) with Jelinek-Mercer and Dirichlet smoothing, applied to three fields of the news articles (title, summary, and content). As for query, we extract the keywords from the complex Boolean queries that our users created and concatenate them as a long query (typically \(\sim \)100 words), where negation terms were ignored.
In order to estimate the relevance of a document with respect to the combined relevance criteria described above, we employ a L2R approach and encode these criteria as features. We devise 28 features (see Table 1) as follows.
Topical Relevance. We model topical relevance using the outputs of traditional retrieval models, as usually done in the literature .
Factual Information. We model factual information with three types of features: (1) Subjectivity: it measures the degree of subjectivity of an article, which is directly related to the “factuality” of the article. (2) Modality: it shows the degree of certainty of the statements of an article by looking at the verb tense in which the article is written. (3) Sentiment: it provides the degree of negativity or positivity of the language used–while not directly related to the factuality dimension, there can be entanglement between the subjective and opinionated dimensions . We employ a lexicon based approach to compute these features [15, 22]. We also include the number of lexicons assessed in an article as a normalisation factor for articles of different lexicon sizes.
Features for learning to rank
Retrieval model scores (12)
TF-IDF, BM25, LM (J-M), and LM (Dir) applied on an article’s title, define summary, and content
Degree of objectivity vs. subjectivity
Degree of certainty of the statements
Negative, positive, neutral, and compound scores 
# of the lexicon’s vocabulary that appear in the article
Kincaid index , Readability Index , Coleman-Liau index , Flesch index , Gunning Fog index , Lasbarhets index , McLaughlin’s SMOG index , John Aderson’s index , and Dale-Chall index 
Apart from devising task specific features, we exploit a pre-trained contextualised language model to automatically learn the complex relevance criteria. By fine tuning the model on our search task we expect to associate these language features with the relevance judgements. We use BERT , which shows the state-of-the-art performance on a wide range of NLP tasks [10, 25]. Inspired by the work of MacAvaney et al. , we employ BERT in its regression form (known as Vanilla BERT in ). Specifically, the input consists of a query-document pair, and the output is a predicted relevance score. For document input, we use (i) a combined title and summary field (referred to as BERT on summary), and (ii) the content of the article (referred to as BERT on content).
3 Experimental Setup
Dataset. The dataset we use to evaluate our retrieval approaches comes from the interaction data of legal professionals with a news monitoring system over a one year period. The users monitor a specific legal topic by querying the news stream with that topic periodically and all the retrieved results are tagged with a relevance judgment for later usage. Given this context, we group the data into equal intervals corresponding to the report creation times and evaluate the retrieved results per interval. The initial ranked lists were generated by using the Boolean query created by the users, and ranked in a chronological order. We apply the alternative ranking approaches as a re-ranking task. In total the dataset consists 206 queries and 60,512 labelled news articles, among which 2,872 (21%) are marked as relevant. By grouping the searches into the equal interval and removing sessions with no relevant articles, we obtain 1,774 search sessions (i.e. query-results pairs). The average number of relevant articles per-session varies from 1.5 to 4.4 articles depending on the queries. We randomly split the dataset into training (80%), validation (10%), and test (10%) sets. The same setup holds for both traditional retrieval models as well as for L2R approaches. We use the validation set to tune the models’ parameters.
Evaluation Measures. We use Mean Average Precision (MAP) to train and measure the performance of the retrieval models. Since recall is important in this user task, we use two recall oriented metrics: R@3 (given the small number of relevant documents per search); and average Search Length (SL) which measures the amount of effort a user needs to find all relevant documents.
Features. The topical features take the scores generated by the traditional IR models with their optimal parameter settings. For language usage features, we use an implementation of the CLiPS pattern-en module for subjectivity and modality ; and VADER  to compute sentiment scores. The 9 readability features are computed using the Python Readability package .
Traditional models vs. Boolean search. Models are run on Title (T), Summary (S), and Content (C). All differences are statistically significant compared to the Boolean search results (paired t-test with p-value \(<0.01\)).
4 Results and Discussion
Regarding RQ1, Table 2 lists the results of traditional retrieval models (TF-IDF, BM25 and Language Model (LM) with Jelinek-Mercer smoothing (J-M) and Dirichlet smoothing (Dir)) compared to that from the Boolean search with a chronological order, i.e. the working practice of our professional users. We see that all retrieval models significantly outperform the Boolean search results for all measures. This suggests that without further effort in terms of feature engineering and model fitting, traditional models already improve the ranking quality by capturing topical relevance. Further, we see that different fields may be best for one model but not for the other, suggesting that their combination in a L2R approach may be beneficial. Hereafter, we choose Dir on content, which has the best MAP score, as a baseline for the remaining experiments.
To address RQ2 and RQ3, Table 3 shows the results of LambdaMART with explicitly encoded features and BERT scores, compared to the Dir baseline.
Results of LambdaMART with feature variants. † indicates a statistical significant difference compared to the baseline Dir with p-value < 0.05 by paired t-test (p-value < 0.01 when ‡).
LambdaMART (all features)
BERT on summary
BERT on content
Next, we observe that BERT based approaches significantly outperform Dir. In particular, in terms of SL, with the baseline a user would need to read on average 5.5 irrelevant documents before finding all relevant documents, while with BERT based models this is reduced to less than 2, providing potentially improved user experience. Moreover, compared to explicit feature engineering, fine tuning BERT seems to have captured the user information needs in an implicit manner. This is encouraging as it not only learns the complex relevance criteria more accurately, but also provides more flexibility as the model can be fine tuned for use cases with different criteria without dedicated feature engineering.
The above results show promising performance of different ranking approaches in terms of off-line IR evaluation, compared to the original Boolean setup. From a user perspective, this means users may be able to confidently stop reading results after seeing certain number of irrelevant results, which would be particularly useful when the result list is long and relevant articles are few. On the other hand, we should also be aware that as the model complexity increases, there is decreasing model explainability and user controllability—the properties of Boolean search appreciated by professional users [16, 24]. Therefore for future work we find it crucial to investigate methods that explain and control complex models such as BERT.
We explored different retrieval approaches to address the complex information needs of professional users in a legal search context. We found that, compared to Boolean search, traditional retrieval models are effective in improving the ranking quality and reducing user effort in finding relevant information (e.g. measured by SL). Learning to rank with explicit feature encoding does not seem to be able to easily improve over traditional models. However, fine-tuning a pre-trained language model (BERT) shows strong improvements over both traditional models and L2R models, with the advantage of not requiring dedicated feature encoding. In particular, our study opens up a number of research questions in the context of professional search: (i) what kind of features allow pre-trained LMs to capture the implicit information needs from users’ relevance judgements? (ii) what are the limitations of pre-trained LMs to capture fine-grained information needs? and; (iii) how does the above depend on the number and quality of the relevance judgements, particularly in the case of niche retrieval tasks?
Information regarding the specific legal domain cannot be disclosed due to a non-disclosure agreements that we have with the legal professionals.
- 2.Readability. https://pypi.org/project/readability/
- 4.Guidelines for the 2011 TREC medical records track (2011). https://www-nlpir.nist.gov/projects/trecmed/2011/. Accessed 26 Aug 2019
- 5.Anderson, J.: Lix and rix: variations on a little-known readability index. J. Read. 26(6), 490–496 (1983)Google Scholar
- 6.Björnsson, C.H.: Läsbarhet. Liber (1968)Google Scholar
- 8.Cormack, G.V., Grossman, M.R., Hedin, B., Oard, D.W.: Overview of the TREC 2010 legal track. In: Proceedings of TREC, vol. 1 (2010)Google Scholar
- 9.Dale, E., Chall, J.S.: A formula for predicting readability: instructions. Educ. Res. Bull. 27, 37–54 (1948)Google Scholar
- 10.Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL (2019)Google Scholar
- 14.Gunning, R.: The Technique of Clear Writing. McGraw-Hill, New York (1952)Google Scholar
- 15.Hutto, C.J., Gilbert, E.: Vader: a parsimonious rule-based model for sentiment analysis of social media text. In: Proceedings of AAAI (2014)Google Scholar
- 16.Kim, Y., Seo, J., Croft, W.B.: Automatic Boolean query suggestion for professional search. In: Proceedings of SIGIR, pp. 825–834. ACM (2011)Google Scholar
- 18.MacAvaney, S., Yates, A., Cohan, A., Goharian, N.: Cedr: contextualized embeddings for document ranking. In: Proceedings of SIGIR (2019)Google Scholar
- 19.Mc Laughlin, G.H.: Smog grading-a new readability formula. J. Read. 12(8), 639–646 (1969)Google Scholar
- 21.Senter, R., Smith, E.A.: Automated readability index. Technical report, Cincinnati Univ. Ohio (1967)Google Scholar
- 22.Smedt, T.D., Daelemans, W.: Pattern for python. J. Mach. Learn. Res. 13(Jun), 2063–2067 (2012)Google Scholar
- 23.Verberne, S., et al.: First international workshop on professional search. In: Proceedings of SIGIR, vol. 52, pp. 153–162. ACM (2019)Google Scholar
- 24.Verberne, S., He, J., Wiggers, G., Russell-Rose, T., Kruschwitz, U., de Vries, A.P.: Information search in a professional context-exploring a collection of professional search tasks. In: Proceedings of SIGIR, Paris, France, pp. 1–5 (2019)Google Scholar
- 25.Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., Le, Q.V.: Xlnet: generalized autoregressive pretraining for language understanding (2019)Google Scholar