On the Usability of Random Indexing in Patent Retrieval

Part of the Lecture Notes in Computer Science book series (LNCS, volume 8577)

Abstract

Statistical semantics methods are fairly controversial in the IR community, mostly because of their instability and difficulty to debug. At the same time, they are extremely tempting, in the same way perhaps, as Artificial Intelligence was in the 60s. Then, it took a few decades for the hype to pass and for us to learn the real utility and limits of the great technologies developed earlier. This paper takes an exhaustive view of the performance and utility of a particular statistical semantics method, Random Indexing, in the context of difficult texts. After over a year of CPU time in experiments, we provide a global view of the behaviour of the method on a particularly challenging test collection based on patent data. In the end, we observe interesting patterns emerging in the semantic space created by the method, which we hypothesize to be the cause of the behaviour observed in the experiments.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Achlioptas, D.: Database-friendly random projections. In: Proc. of PODS (2001)Google Scholar
  2. 2.
    Adams, S.: The text, the full text and nothing but the text: Part 1 - standards for creating textual information in patent documents and general search implications. WPI Journal 32(1), 22–29 (2010)Google Scholar
  3. 3.
    Atkinson, K.H.: Towards a more rational patent search paradigm. In: Proc. of PaIR (2008)Google Scholar
  4. 4.
    Bast, H., Majumdar, D.: Why spectral retrieval works. In: Proc. of SIGIR (2005)Google Scholar
  5. 5.
    Bingham, E., Mannila, H.: Random projection in dimensionality reduction: applications to image and text data. In: Proc. of KDD (2001)Google Scholar
  6. 6.
    Bradford, R.B.: An empirical study of required dimensionality for large-scale latent semantic indexing applications. In: Proc. of CIKM (2008)Google Scholar
  7. 7.
    Cohen, T., Schvaneveldt, R., Widdows, D.: Reflective random indexing and indirect inference: A scalable method for discovery of implicit connections. Journal of Biomedical Informatics 43(2) (2010)Google Scholar
  8. 8.
    Furnas, G.W., Dumais, S.T., Landauer, T.K., Harshman, R.A., Streeter, L.A., Lochbaum, K.E.: Information Retrieval using Singular Value Decomposition Model of Latent Semantic Structure. In: Proc. of SIGIR (1988)Google Scholar
  9. 9.
    Garron, A., Kontostathis, A.: Applying latent semantic indexing on the trec 2010 legal dataset. In: Text Retrieval Conference, TREC (2010)Google Scholar
  10. 10.
    Johnson, W.B., Lindenstrauss, J.: Extensions to lipschiz mapping into hilbert space. Contemporary Mathematics 26 (1984)Google Scholar
  11. 11.
    Joho, H., Sanderson, M.: Document frequency and term specificity. In: Large Scale Semantic Access to Content (Text, Image, Video, & Sound), RIAO (2007)Google Scholar
  12. 12.
    Jonnalagadda, S., Cohen, T., Wu, S., Gonzalez, G.: Enhancing clinical concept extraction with distributional semantics. Journal of Biomedical Informatics 45(1), 129–140 (2012)CrossRefGoogle Scholar
  13. 13.
    Karlgren, J., Sahlgren, M.: From words to understanding. In: Uesaka, Y., Kanerva, P., Ashton, H. (eds.) Foundations of Real-World Intelligence (2001)Google Scholar
  14. 14.
    Landauer, T.K., Dumais, S.T.: A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 211–240 (1997)Google Scholar
  15. 15.
    Lund, K., Burgess, C.: Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods 28 (1996)Google Scholar
  16. 16.
    Lupu, M., Hanbury, A.: Patent Retrieval. Foundations and Trends in Information Retrieval 7(1) (2013)Google Scholar
  17. 17.
    Martin, D., Berry, M.: Mathematical Foundations Behind Latent Semantic Analysis. In: Handbook of Latent Semantic Analysis (2007)Google Scholar
  18. 18.
    Oostdijk, N., D’hondt, E., van Halteren, H., Verberne, S.: Genre and domain in patent texts. In: Proc. of PaIR (2010)Google Scholar
  19. 19.
    Piroi, F., Lupu, M., Hanbury, A., Zenz, V.: Clef-ip 2011: Retrieval in the intellectual property domain. In: CLEF (Notebook Papers/Labs/Workshop) (2011)Google Scholar
  20. 20.
    Sahlgren, M.: An introduction to random indexing. Technical report, SICS, Swedish Institute of Computer Science (2005)Google Scholar
  21. 21.
    Sahlgren, M., Hansen, P., Karlgren, J.: English-Japanese cross-lingual query expansion using random indexing of aligned bilingual text data. In: Proc. of NTCIR (2002)Google Scholar
  22. 22.
    Sahlgren, M., Karlgren, J.: Vector-based semantic analysis using random indexing for cross-lingual query expansion. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds.) CLEF 2001. LNCS, vol. 2406, pp. 169–176. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  23. 23.
    Sahlgren, M., Karlgren, J.: Terminology mining in social media. In: Proc. of CIKM (2009)Google Scholar
  24. 24.
    Sanderson, M.: Ambiguous queries: test collections need more sense. In: Proc. of SIGIR (2008)Google Scholar
  25. 25.
    Schütze, H.: Dimensions of meaning. In: Proceedings of the Supercomputing 1992 (1992)Google Scholar
  26. 26.
    Schütze, H., Pederse, J.O.: A cooccurrence-based thesaurus and two applications to information retrieval. Information Processing & Management 33(3) (1997)Google Scholar
  27. 27.
    Widdows, D., Cohen, T.: The semantic vectors package: New algorithms and public tools for distributional semantics. In: Proc. of ICSC (2010)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.Institute of Software Technology and Interactive SystemsVienna University of TechnologyWienAustria

Personalised recommendations