Skip to main content
Log in

The Hybrid Method for Accurate Patent Classification

  • Published:
Lobachevskii Journal of Mathematics Aims and scope Submit manuscript

Abstract

This article is dedicated to stacking of two approaches of patent classification. First is based on linguistically-supported k-nearest neighbors algorithm using the method of search for topically similar documents based on a comparison of vectors of lexical descriptors. Second is the word embeddings based fastText, where the sentence (or a document) vector is obtained by averaging the n-gram embeddings, and then a multinomial logistic regression exploits these vectors as features. We show in Russian and English datasets that stacking classifier shows better results compared to single classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. D. Eisinger, G. Tsatsaronis, M. Bundschus, U. Wieneke, and M. Schroeder, “Automated patent categorizationand guided patent search using IPC as inspired by MeSH and PubMed,” J. Biomed. Semant, BioMed Centr. 4, S3 (2013).

    Article  Google Scholar 

  2. V. Yadrintsev, A. Bakarov, R. Suvorov, and I. Sochenkov, “Fast and accurate patent classification in search engines,” J. Phys.: Conf. Ser. 1117, 012004 (2018).

    Google Scholar 

  3. I. V. Sochenkov, D. V. Zubarev, and I. A. Tikhomirov, “Exploratory patent search,” Inform. Prilozh. 12, 89–94 (2018).

    Google Scholar 

  4. A. Shvets, D. Devyatkin, I. Sochenkov, I. Tikhomirov, K. Popov, and K. Yarygin, “Detection of current research directions based on fulltext clustering,” in Proceedings of the 2015 Science and Information Conference (SAI), 2015, pp. 483–488.

    Chapter  Google Scholar 

  5. H. Schutze, C. D. Manning, and P. Raghavan, Introduction to Information Retrieval (Cambridge Univ. Press, Cambridge, 2008).

    MATH  Google Scholar 

  6. C. D. Manning and H. Schutze, Foundations of Statistical Natural language Processing (MIT press, Boston, MA, 1999).

    MATH  Google Scholar 

  7. K. V. Vorontsov, “Additive regularization for topic models of text collection,” Dokl. Akad. Nauk 89, 301–304 (2014).

    MathSciNet  MATH  Google Scholar 

  8. I. Moloshnikov, A. Sboev, D. Gudovskikh, and R. Rybka, “An algorithm of finding thematically similar documents with creating context-semantic graph based on probabilistic-entropy approach,” Proc. Comput. Sci. 66, 297–306 (2015).

    Article  Google Scholar 

  9. M. Nokel and N. Loukachevitch, “Accounting ngrams and multiword terms can improve topic models,” in Proceedings of 12th Workshop on Multiword Expressions (MWE’2016) (ACM, Stroudsburg, PA, USA, 2016), pp. 44–49.

    Chapter  Google Scholar 

  10. T. Grainger, T. Potter, and Y. Seeley, Solr in action (Cherry Hill, Manning, 2014).

    Google Scholar 

  11. P. Glauner, J. Iwaszkiewicz, J.-Y. Meur, and T. Simko, “Use of Solr and Xapian in the Invenio document repository software.” arXiv: 1310.0250 (2013).

    Google Scholar 

  12. S. Ilyinsky, M. Kuzmin, A. Melkov, and I. Segalovich, “An efficient method to detect duplicates of Web documents with the use of inverted index,” in Proceedings of the 11th International World Wide Web Conference (WWW2002) (ACM, New York, 2002).

    Google Scholar 

  13. M. S. Ageev and B. V. Dobrov, “An efficient nearest neighbours search algorithm for full-text documents,” Vestn. SPb. Univ., Prikl. Mat. Komp’yut. Nauki 3, 72–84 (2011) [in Russian].

    Google Scholar 

  14. P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Trans. Assoc. Comput. Linguist. 5, 135–146 (2017).

    Article  Google Scholar 

  15. Logistic Regression. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Logistic-Regression.html. Accessed 2019.

  16. Linear Support Vector Classification. https://scikit-learn.org/stable/modules/generated/sklearn.svm. Lin-earSVC.html. Accessed 2019.

  17. One-vs-the-rest (OvR) Multiclass/Multilabel Strategy. https://scikit-learn.org/stable/modules/ gener-ated/sklearn.multiclass.OneVsRestClassifier.html#sklearn.multiclass. Accessed 2019.

  18. Russian Federal Institute of Industrial Property. http://fips.ru/. Accessed 2019.

  19. M. Krier and F. Zacca, “Automatic categorisation applications at the European patent office,” World Patent Inform. 24, 187–196 (2002).

    Article  Google Scholar 

  20. C. J. Fall and K. Benzineb, “Literature survey: Issues to be considered in the automatic classification of patents,” World Intell. Property Organiz. 29 (2002).

  21. C. J. Fall, A. Torcsvari, K. Benzineb, and G. Karetka, “Automated categorization in the international patent classification,” in Proceedings of the Acm Sigir Forum (ACM, 2003), Vol. 37, pp. 10–25.

    Article  Google Scholar 

  22. A. J. Trappey, F.-C. Hsu, C. V. Trappey, and C.-I. Lin, “Development of a patent document classification and search platform using a back-propagation network,” Expert Syst. Appl. 31, 755–765 (2006).

    Article  Google Scholar 

  23. F. Piroi, M. Lupu, A. Hanbury, A. P. Sexton, W. Magdy, and I. V. Filippov, “Clef-ip 2010: Retrieval experiments in the intellectual property domain,” in Proceedings of the CLEF: Notebook Papers, Labs, Workshops, 2010.

    Google Scholar 

  24. S. Verberne and E. D’hondt, “Patent classification experiments with the linguistic classification system LCS in CLEF-IP 2011,” in Proceedings of the CLEF: Notebook Papers, Labs, Workshops, 2011.

    Google Scholar 

  25. Y.-L. Chen and Y.-C. Chang, “A three-phase method for patent classification,” Inform. Process. Manage. 48, 1017–1030 (2012).

    Article  Google Scholar 

  26. E. D’hondt, S. Verberne, C. Koster, and L. Boves, “Text representations for patent classification,” Comput. Linguist. 39, 755–775 (2013).

    Article  Google Scholar 

  27. X. Zhang, “Interactive patent classification based on multi-classifier fusion and active learning,” Neurocomputing 127, 200–205 (2014).

    Article  Google Scholar 

  28. S. Arts, B. Cassiman, and J. C. Gomez, “Text matching to measure patent similarity,” Strateg. Manage. J. 39, 62–84 (2018).

    Article  Google Scholar 

Download references

Acknowledgments

We are grateful to the reviewers for careful reading of the manuscript and helpful remarks.

Funding

This article presents the research results of the project “Text mining tools for big data” as a part of the program supporting Technical Leadership Centers of the National Technological Initiative “Center for Big Data Storage and Processing” at the Moscow State University (Agreement with Fund supporting the NTI-projects no. 13/1251/2018 11.12.2018). The reported study is partially funded by the Russian Foundation for Basic Research (project no. 16-29-12929) and with the support of the “RUDN University Program 5–100.”

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to V. V. Yadrintsev or I. V. Sochenkov.

Additional information

Submitted by Vl. V. Voevodin

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yadrintsev, V.V., Sochenkov, I.V. The Hybrid Method for Accurate Patent Classification. Lobachevskii J Math 40, 1873–1880 (2019). https://doi.org/10.1134/S1995080219110325

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S1995080219110325

Keywords and phrases

Navigation