Information Retrieval

, Volume 17, Issue 5–6, pp 520–544 | Cite as

Dealing with temporal variation in patent categorization

  • Eva D’hondt
  • Suzan Verberne
  • Nelleke Oostdijk
  • Jean Beney
  • Cornelius Koster
  • Lou Boves
Information Retrieval in the Intellectual Property Domain

Abstract

In this paper, we quantify the existence of concept drift in patent data, and examine its impact on classification accuracy. When developing algorithms for classifying incoming patent applications with respect to their category in the International Patent Classification (IPC) hierarchy, a temporal mismatch between training data and incoming documents may deteriorate classification results. We measure the effect of this temporal mismatch and aim to tackle it by optimal selection of training data. To illustrate the various aspects of concept drift on IPC class level, we first perform quantitative analyses on a subset of English abstracts extracted from patent documents in the CLEF-IP 2011 patent corpus. In a series of classification experiments, we then show the impact of temporal variation on the classification accuracy of incoming applications. We further examine what training data selection method, combined with our classification approach yields the best classifier; and how combining different text representations may improve patent classification. We found that using the most recent data is a better strategy than static sampling but that extending a set of recent training data with older documents does not harm classification performance. In addition, we confirm previous findings that using 2-skip-2-grams on top of the bag of unigrams structurally improves patent classification. Our work is an important contribution to the research into concept drift for text classification, and to the practice of classifying incoming patent applications.

Keywords

Concept drift Patent classification Text representation 

References

  1. Benzineb, K., & Guyot, J. (2011). Automated patent classification. In M. Lupu, K. Mayer, J. Tait, & A. J. Trippe (Eds.), Current challenges in patent information retrieval (Vol. 29, pp. 239–261). Berlin: Springer.CrossRefGoogle Scholar
  2. Carmona-Cejudo, J. M., Baena-García, M., Bueno, R. M., Gama, J., & Bifet, A. (2011). Using gnusmail to compare data stream mining methods for on-line email classification. Journal of Machine Learning Research-Proceedings Track, 17, 12–18.Google Scholar
  3. Cohen, A., Bhupatiraju, R., & Hersh, W. (2004). Feature generation, feature selection, classifiers, and conceptual drift for biomedical document triage. In Proceedings of the thirteenth text retrieval conference-TREC.Google Scholar
  4. Dagan, I., Karov, Y., Roth, D. (1997). Mistake-driven learning in text categorization. In Proceedings of 2nd conference on empirical methods in NLP, Providence, pp. 55–63.Google Scholar
  5. D’hondt, E., Verberne, S., Weber, N., Koster, K., & Boves, L. (2012). Using skipgrams and pos-based feature selection for patent classification. Computational Linguistics in the Netherlands Journal, 2, 52–70.Google Scholar
  6. D’hondt, E., Verberne, S., Koster, C., & Boves, L. (2013). Text representations for patent classification. Computational Linguistics, 39(3), 755–775.CrossRefGoogle Scholar
  7. Fawcett, T. (2003). “In vivo” spam filtering: A challenge problem for KDD. ACM SIGKDD Explorations Newsletter, 5(2), 140–148.CrossRefGoogle Scholar
  8. Forman, G. (2004). A pitfall and solution in multi-class feature selection for text classification. In Proceedings of the twenty-first international conference on machine learning, ICML ’04 (pp. 38–45). New York, NY: ACM.Google Scholar
  9. Forman, G. (2006). Tackling concept drift by temporal inductive transfer. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’06 (pp. 252–259). New York, NY: ACM.Google Scholar
  10. Frantzi, K., Ananiadou, S., & Tsujii, J. (1998). The C-value/NC-value method of automatic recognition for multi-word terms. In Proceedings of the second European conference on research and advanced technology for digital libraries, ECDL ’98 (pp. 585–604). London: Springer.Google Scholar
  11. Galavotti, L., Sebastiani, F., & Simi, M. (2000). Experiments on the use of feature selection and negative evidence in automated text categorization. In Proceedings of research and advanced technology for digital libraries, 4th European conference, Lisbon, pp. 59–68.Google Scholar
  12. Ja, Gama, Medas, P., Castillo, G., & Rodrigues, P. (2004). Learning with drift detection. In A. Bazzan & S. Labidi (Eds.), Advances in artificial intelligence SBIA 2004, lecture notes in computer science (Vol. 3171, pp. 286–295). Berlin: Springer.CrossRefGoogle Scholar
  13. Joachims, T. (1999). Making large-scale support vector machine learning practical. In B. Schölkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in Kernel methods (pp. 169–184). Cambridge: MIT Press.Google Scholar
  14. Kelly, M., Hand, D., & Adams, N. (1999). The impact of changing populations on classifier performance. In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’99 (pp. 367–371). New York, NY: ACM.Google Scholar
  15. Klimt, B., & Yang, Y. (2004) The enron corpus: A new dataset for email classification research. In Proceedings of the 15th European conference on machine learning, ECML 2004, Vol. 15, p. 217. Berlin: Springer.Google Scholar
  16. Klinkenberg, R. (2004). Learning drifting concepts: Example selection vs. example weighting. Intelligent Data Analysis, 8(3), 281–300.Google Scholar
  17. Koster, C., & Beney, J. (2009). Phrase-based document categorization revisited. In Proceedings of the 2nd international workshop on patent information retrieval, PaIR ’09 (pp. 49–56). New York, NY: ACM.Google Scholar
  18. Koster, C., & Seutter, M., Beney, J. (2003). Multi-classification of patent applications with winnow. In M. Broy, A. V. Zamulin (Eds,). Ershov memorial conference, Lecture Notes in Computer Science, Vol. 2890 (pp. 546–555). Berlin: Springer.Google Scholar
  19. Koster, C., Beney, J., Verberne, S., & Vogel, M. (2011). Phrase-based document categorization. In M. Lupu, K. Mayer, J. Tait, & A. J. Trippe (Eds.), Current Challenges in Patent Information Retrieval (Vol. 29, pp. 263–286). Berlin: Springer.CrossRefGoogle Scholar
  20. Koychev, I. (2000). Gradual forgetting for adaptation to concept drift. In Proceedings of ECAI 2000 workshop on current issues in Spatio-Temporal reasoning.Google Scholar
  21. Kuncheva, L. (2004). Classifier ensembles for changing environments. In F. Roli, J. Kittler, & T. Windeatt (Eds.), Multiple classifier systems, lecture notes in computer science (Vol. 3077, pp. 1–15). Berlin: Springer.CrossRefGoogle Scholar
  22. Lebanon, G., & Zhao, Y. (2008). Local likelihood modeling of temporal text streams. Proceedings of the 25th international conference on Machine learning—ICML ’08 (pp. 552–559). New York, NY: ACM Press.CrossRefGoogle Scholar
  23. Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. The Journal of Machine Learning Research, 5, 361–397.Google Scholar
  24. Liu, R., & Lu, Y. (2002). Incremental context mining for adaptive document classification. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 599–604). New York: ACM.Google Scholar
  25. Ma, C., Lu, B. L., & Utiyama, M. (2009). Incorporating prior knowledge into task decomposition for large-scale patent classification. In W. Yu, H. He, & N. Zhang (Eds.), Advances in neural networks ISNN 2009, lecture notes in computer science (Vol. 5552, pp. 784–793). Berlin: Springer.CrossRefGoogle Scholar
  26. Mourão, F., Rocha, L., Araújo, R., Couto, T., Gonçalves, M., & Meira, W. J. (2008). Understanding temporal aspects in document classification. In Proceedings of the 2008 international conference on web search and data mining (WSDM ’08) (pp. 159–170). New York: ACM.Google Scholar
  27. Nanba, H., Fujii, A., Iwayama, M., & Hashimoto, T. (2008). Overview of the patent mining task at the NTCIR-7 workshop. In Proceedings of NTCIR-7 workshop meeting, pp. 325–332.Google Scholar
  28. Nanba, H., Fujii, A., Iwayama, M., & Hashimoto, T. (2010). Overview of the patent mining task at the NTCIR-8 workshop. In Proceedings of NTCIR-7 workshop meeting, pp. 293–302.Google Scholar
  29. Oostdijk, N., Verberne, S.,&Koster, C. (2010). Constructing a broad-coverage lexicon for text mining in the patent domain. In Chair NCC, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, D. Tapias (Eds.), Proceedings of the seventh international conference on language resources and evaluation (LREC’10), Valletta, Malta.Google Scholar
  30. Richter, G., & MacFarlane, A. (2005). The impact of metadata on the accuracy of automated patent classification. World Patent Information, 27(1), 13–26.CrossRefGoogle Scholar
  31. Rocha, L., Mourão, F., Mota, H., Salles, T., Gonçalves, M. A., & Meira, W, Jr. (2012). Temporal contexts: Effective text classification in evolving document collections. Information Systems, 38(3), 388–409.CrossRefGoogle Scholar
  32. Salles, T., Rocha, L., Pappa, G.L., Mourão, F., Meira, W. Jr, & Gonçalves, M. (2010). Temporally-aware algorithms for document classification. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval, SIGIR ’10 (pp. 307–314). New York, NY: ACM.Google Scholar
  33. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing Management, 24(5), 513–523.CrossRefGoogle Scholar
  34. SanJuan, E., Dowdall, J., Ibekwe-SanJuan, F., & Rinaldi, F. (2005). A symbolic approach to automatic multiword term structuring. Computer Speech and Language, 19(4), 524–542.CrossRefGoogle Scholar
  35. Schlimmer, J., & Granger, R, Jr. (1986). Incremental learning from noisy data. Machine Learning, 1, 317–354.Google Scholar
  36. Scholz, M., & Klinkenberg, R. (2007). Boosting classifiers for drifting concepts. Intelligent Data Analysis, 11(1), 3–28.Google Scholar
  37. Segal, R., & Kephart, J. (1999). Mailcat: An intelligent assistant for organizing e-mail. In Proceedings of the third annual conference on autonomous agents (pp. 276–282). New York, NY: ACM.Google Scholar
  38. Šilić, A., & Dalbelo Bašić, B. (2012). Exploring classification concept drift on a large news text corpus. In Computational linguistics and intelligent text processing, pp. 428–437.Google Scholar
  39. Tsymbal, A. (2004). The problem of concept drift: Definitions and related work. Tech. Rep. TCD-CS-2004-15, Computer Science Department, Trinity College Dublin.Google Scholar
  40. van Halteren, H. (2000). The detection of inconsistency in manually tagged text. In Proceedings of LINC-00.Google Scholar
  41. Verberne, S., Vogel, M., & D’hondt, E. (2010). Patent classification experiments with the linguistic classification system LCS. In Proceedings of the conference on multilingual and multimodal information access evaluation (CLEF 2010), Padua.Google Scholar
  42. Žliobaitė, I. (2009). Learning under concept drift: An overview. Tech. rep.: Vilnius University.Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Eva D’hondt
    • 1
  • Suzan Verberne
    • 1
  • Nelleke Oostdijk
    • 1
  • Jean Beney
    • 2
  • Cornelius Koster
    • 1
  • Lou Boves
    • 1
  1. 1.Radboud University NijmegenNijmegenThe Netherlands
  2. 2.Université de LyonLyonFrance

Personalised recommendations