Information Retrieval

, Volume 17, Issue 5–6, pp 492–519

Studying machine translation technologies for large-data CLIR tasks: a patent prior-art search case study

Information Retrieval in the Intellectual Property Domain

DOI: 10.1007/s10791-013-9231-6

Cite this article as:
Magdy, W. & Jones, G.J.F. Inf Retrieval (2014) 17: 492. doi:10.1007/s10791-013-9231-6
  • 313 Downloads

Abstract

Prior-art search in patent retrieval is concerned with finding all existing patents relevant to a patent application. Since patents often appear in different languages, cross-language information retrieval (CLIR) is an essential component of effective patent search. In recent years machine translation (MT) has become the dominant approach to translation in CLIR. Standard MT systems focus on generating proper translations that are morphologically and syntactically correct. Development of effective MT systems of this type requires large training resources and high computational power for training and translation. This is an important issue for patent CLIR where queries are typically very long sometimes taking the form of a full patent application, meaning that query translation using MT systems can be very slow. However, in contrast to MT, the focus for information retrieval (IR) is on the conceptual meaning of the search words regardless of their surface form, or the linguistic structure of the output. Thus much of the complexity of MT is not required for effective CLIR. We present an adapted MT technique specifically designed for CLIR. In this method IR text pre-processing in the form of stop word removal and stemming are applied to the MT training corpus prior to the training phase. Applying this step leads to a significant decrease in the MT computational and training resources requirements. Experimental application of the new approach to the cross language patent retrieval task from CLEF-IP 2010 shows that the new technique to be up to 23 times faster than standard MT for query translations, while maintaining IR effectiveness statistically indistinguishable from standard MT when large training resources are used. Furthermore the new method is significantly better than standard MT when only limited translation training resources are available, which can be a significant issue for translation in specialized domains. The new MT technique also enables patent document translation in a practical amount of time with a resulting significant improvement in the retrieval effectiveness.

Keywords

Cross-language patent retrieval Prior-art Patent search Cross-language information retrieval Large-data CLIR Machine translation 

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.Qatar Computing Research InstituteQatar FoundationDohaQatar
  2. 2.Centre of Next Generation Localization, School of ComputingDublin City UniversityDublin 9Ireland

Personalised recommendations