Information Retrieval

, Volume 17, Issue 5–6, pp 492–519 | Cite as

Studying machine translation technologies for large-data CLIR tasks: a patent prior-art search case study

  • Walid Magdy
  • Gareth J. F. Jones
Information Retrieval in the Intellectual Property Domain


Prior-art search in patent retrieval is concerned with finding all existing patents relevant to a patent application. Since patents often appear in different languages, cross-language information retrieval (CLIR) is an essential component of effective patent search. In recent years machine translation (MT) has become the dominant approach to translation in CLIR. Standard MT systems focus on generating proper translations that are morphologically and syntactically correct. Development of effective MT systems of this type requires large training resources and high computational power for training and translation. This is an important issue for patent CLIR where queries are typically very long sometimes taking the form of a full patent application, meaning that query translation using MT systems can be very slow. However, in contrast to MT, the focus for information retrieval (IR) is on the conceptual meaning of the search words regardless of their surface form, or the linguistic structure of the output. Thus much of the complexity of MT is not required for effective CLIR. We present an adapted MT technique specifically designed for CLIR. In this method IR text pre-processing in the form of stop word removal and stemming are applied to the MT training corpus prior to the training phase. Applying this step leads to a significant decrease in the MT computational and training resources requirements. Experimental application of the new approach to the cross language patent retrieval task from CLEF-IP 2010 shows that the new technique to be up to 23 times faster than standard MT for query translations, while maintaining IR effectiveness statistically indistinguishable from standard MT when large training resources are used. Furthermore the new method is significantly better than standard MT when only limited translation training resources are available, which can be a significant issue for translation in specialized domains. The new MT technique also enables patent document translation in a practical amount of time with a resulting significant improvement in the retrieval effectiveness.


Cross-language patent retrieval Prior-art Patent search Cross-language information retrieval Large-data CLIR Machine translation 



This research is supported by the Science Foundation Ireland (Grant 07/CE/I1142) as part of the Centre for Next Generation Localisation (CNGL) project.


  1. Azzopardi, L., Joho, H., & Vanderbauwhede, W. (2010). A survey on patent users search behavior, search functionality and system requirements. IRF Report, 1, 2010.Google Scholar
  2. Chen, A., & Gey, F. (2004). Combining Query Translation and Document Translation in Cross-Language Retrieval. Proceedings of CLEF-2003.Google Scholar
  3. Darwish, K., & Oard, D. W. (2003). Probabilistic structured query methods. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval SIGIR’03, Toronto, Canada.Google Scholar
  4. Franz, M., & McCarley, S. (2002). Arabic information retrieval at IBM. Proceedings of TREC-2002.Google Scholar
  5. Fujii, A. (2007). Enhancing patent retrieval by citation analysis. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval SIGIR’07, Amsterdam, The Netherlands.Google Scholar
  6. Gao, J., Nie, J-Y., Xun, E., Zhang, J., Zhou, M., & Huang, C. (2001). Improving query translation for cross-language information retrieval using statistical models. Proceedings of the 24th annual international ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR 2001). Louisiana, USA.Google Scholar
  7. Hull, D. (1993). Using statistical testing in the evaluation of retrieval Experiments. Proceedings of the 16th annual international ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR’ 93), Pittsburgh, Pennsylvania, USA.Google Scholar
  8. Iwayama, M., Fujii, A., Kando, N., & Takano, A. (2003). Overview of patent retrieval task at NTCIR-3. Proceedings of the 3rd NTCIR Workshop.Google Scholar
  9. Jochim, C., Lioma, C., Schütze, H., Koch, S., & Ertl, T. (2010). Preliminary study into query translation for patent retrieval. Proceedings of the 3rd international workshop on Patent information retrieval (PaIR ‘10), Toronto, Canada.Google Scholar
  10. Jones, G. J. F., Sakai, T., Collier, N. H., Kumano, A., & Sumita, K. (1999). A comparison of query translation methods for English-Japanese cross-language information retrieval. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 99), San Francisco, U.S.A.Google Scholar
  11. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., & Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session, Prague, Czech Republic.Google Scholar
  12. Krier, M., & Zacca, F. (2002). Automatic categorization applications at the European patent office. World Patent Information, 24(3), 187–196.CrossRefGoogle Scholar
  13. Leong, M.K. (2001). Patent data for IR research and evaluation. Proceedings of the 2nd NTCIR Workshop.Google Scholar
  14. Leveling, J., Magdy, W., & Jones, G. J. F. (2011). An investigation of decompounding for cross-language patent search. Proceedings of the 34th annual international SIGIR conference on Research and Development in Information Retrieval (SIGIR’11). Beijing, China.Google Scholar
  15. Levow, G.-A., Oard, D. W., & Resnik, P. (2005). Dictionary-based techniques for cross-language information retrieval. Information Processing and Management, 41(3), 523–547.CrossRefGoogle Scholar
  16. Lopez, P., & Romary, L. (2010). Experiments with citation mining and key-term extraction for prior art search. Proceedings of the CLEF-2010.Google Scholar
  17. Lupu, M., & Hanbury, A. (2013). Patent retrieval. Foundations and Trends® in Information Retrieval, 7(1), 1–97.CrossRefGoogle Scholar
  18. Ma, Y., Nie, J., Wu, H., & Wang, H. (2012). Opening Machine Translation Black Box for Cross-Language Information Retrieval. Information Retrieval Technology. Lecture Notes in Computer Science, 7675, 467–476.Google Scholar
  19. Magdy W., & Jones, G. J. F. (2011). Should MT systems be used as black boxes in CLIR?. Proceeding of the 33rd European Conference on Information Retrieval (ECIR’11). Dublin, Ireland.Google Scholar
  20. Magdy, W. (2012). Toward higher effectiveness for recall-oriented information retrieval: A patent retrieval case study. PhD Thesis, Dublin City University.Google Scholar
  21. Magdy, W., & Jones., G. J. F. (2010). PRES: A score metric for evaluating recall-oriented information retrieval applications. Proceedings of the 33rd annual international SIGIR conference on Research and Development in Information Retrieval (SIGIR’10). Geneva, Switzerland.Google Scholar
  22. Magdy, W., & Jones, G. J. F. (2010). Examining the robustness of evaluation metrics for patent retrieval with incomplete relevance judgements. Iroceedings of the CLEF 2010: Conference on Cross-Language Information Retrieval and Evaluation, Padua, Italy.Google Scholar
  23. Magdy, W., & Jones, G. J. F. (2010). Applying the KISS principle for the CLEF-IP 2010 prior art candidate patent search task. Proceedings of CLEF-2010.Google Scholar
  24. Magdy, W., & Jones, G.J.F. (2011). A Study of Query Expansion Methods for Patent Retrieval. Proceedings of PaIR worjshop 2011, Glasgow, Scotland.Google Scholar
  25. Magdy, W., & Jones, G. J. F. (2011). An efficient method for using machine translation technologies in cross-language patent search. Proceedings of the 20th ACM international conference on Information and Knowledge Management (CIKM’11). Glasgow, Scotland.Google Scholar
  26. Manning, C. D., Raghavan, P., & Schütze, H. (2009). Introduction to information retrieval. Cambridge: Cambridge University Press.Google Scholar
  27. Nie J.-Y. (2010). Cross-Language Information Retrieval. Morgan & Claypool Publishers.Google Scholar
  28. Oard, D. W. (1998). A comparative study of query and document translation for cross-language information retrieval. Proceedings of the 3rd conference of the association for machine translation in the Americas on MT and the information soup AMTA.Google Scholar
  29. Oard, D. W., & Diekema, A. R. (1998). Cross-language information retrieval. In M. Williams (Ed.), Annual review of information science ARIST, pp. 223–256.Google Scholar
  30. Oard, D. W., & Gey, F. (2002). The TREC-2002 Arabic/English CLIR track. Proceedings of TREC-2002.Google Scholar
  31. Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 19(1), 19–51.CrossRefGoogle Scholar
  32. Papineni, K., Roukos, S., Ward, T., & Zhu,W.-J. (2001). BLEU: A method for automatic evaluation of machine translation. Technical Report RC22176(W0109-022), IBM Research Report.Google Scholar
  33. Parton, K., McKeown, K. R., Allan, J., & Henestroza, E. (2008). Simultaneous multilingual search for translingual information retrieval. Proceedings of ACM 17th Conference on Information and Knowledge Management (CIKM’08), California, US.Google Scholar
  34. Piroi, F. (2010). CLEF-IP 2010: Retrieval experiments in the intellectual property domain. Proceedings of CLEF-2010.Google Scholar
  35. Piroi, F., Lupu, M., Hanbury, A., Magdy, W., Sexton, A. P., & Filippov, I. (2012). CLEF-IP 2012: Retrieval experiments in the intellectual property domain. Proceedings of CLEF-2012.Google Scholar
  36. Roda, G., Tait, J., Piroi, F., & Zenz, V. (2009). CLEF-IP 2009: Retrieval experiments in the intellectual property domain. Proceedings of CLEF-2009.Google Scholar
  37. Strohman, T., Metzler, D., Turtle, H., & Croft, W. B. (2004). Indri: A language model-based search engine for complex queries. Proceedings of the International Conference on Intelligence Analysis.Google Scholar
  38. Stroppa, N., & Way, A. (2006). MaTrEx: DCU machine translation system for IWSLT 2006. Proceedings of the International Workshop on Spoken Language Translation, Kyoto, Japan.Google Scholar
  39. Teodoro, D., Gobeill, J., Pasche, E., Vishnyakova, D., Ruch, P., & Lovis, C. (2010). Automatic prior art searching and patent encoding at CLEF-IP’10. Proceedings of CLEF-2010.Google Scholar
  40. Ture, F., Lin, J., & Oard, D.W. (2012). Looking inside the box: Context-sensitive translation for cross-language information retrieval. Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval (SIGIR’12). New York, NY, USA.Google Scholar
  41. Verberne, S., D’hondt, E., & Oostdijk, N. (2010). Quantifying the challenges in parsing patent claims. Proceedings of the 1st International Workshop on Advances in Patent Information Retrieval AsPIRe’10.Google Scholar
  42. Wang, W., Knight, K., & Marcu, D. (2006). Capitalizing machine translation. Proceedings of the Joint Conference on Human Language Technologies and the Annual Meeting of the North American Chapter of the Association of Computational Linguistics (HLT-NAACL), New York, USA.Google Scholar
  43. Wang, J., & Oard, D. W. (2006). Combining bidirectional translation and synonymy for cross-language informzation retrieval. Proceedings of the 29th annual international ACM SIGIR conference on Research and Development in Information Retrieval, Seattle, Washington, USA.Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.Qatar Computing Research InstituteQatar FoundationDohaQatar
  2. 2.Centre of Next Generation Localization, School of ComputingDublin City UniversityDublin 9Ireland

Personalised recommendations