Lobachevskii Journal of Mathematics

, Volume 40, Issue 11, pp 1873–1880 | Cite as

The Hybrid Method for Accurate Patent Classification

  • V. V. YadrintsevEmail author
  • I. V. SochenkovEmail author


This article is dedicated to stacking of two approaches of patent classification. First is based on linguistically-supported k-nearest neighbors algorithm using the method of search for topically similar documents based on a comparison of vectors of lexical descriptors. Second is the word embeddings based fastText, where the sentence (or a document) vector is obtained by averaging the n-gram embeddings, and then a multinomial logistic regression exploits these vectors as features. We show in Russian and English datasets that stacking classifier shows better results compared to single classifiers.

Keywords and phrases

stacking similarity search KNN word embeddings fastText patent classification 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.



We are grateful to the reviewers for careful reading of the manuscript and helpful remarks.


This article presents the research results of the project “Text mining tools for big data” as a part of the program supporting Technical Leadership Centers of the National Technological Initiative “Center for Big Data Storage and Processing” at the Moscow State University (Agreement with Fund supporting the NTI-projects no. 13/1251/2018 11.12.2018). The reported study is partially funded by the Russian Foundation for Basic Research (project no. 16-29-12929) and with the support of the “RUDN University Program 5–100.”


  1. 1.
    D. Eisinger, G. Tsatsaronis, M. Bundschus, U. Wieneke, and M. Schroeder, “Automated patent categorizationand guided patent search using IPC as inspired by MeSH and PubMed,” J. Biomed. Semant, BioMed Centr. 4, S3 (2013).CrossRefGoogle Scholar
  2. 2.
    V. Yadrintsev, A. Bakarov, R. Suvorov, and I. Sochenkov, “Fast and accurate patent classification in search engines,” J. Phys.: Conf. Ser. 1117, 012004 (2018).Google Scholar
  3. 3.
    I. V. Sochenkov, D. V. Zubarev, and I. A. Tikhomirov, “Exploratory patent search,” Inform. Prilozh. 12, 89–94 (2018).Google Scholar
  4. 4.
    A. Shvets, D. Devyatkin, I. Sochenkov, I. Tikhomirov, K. Popov, and K. Yarygin, “Detection of current research directions based on fulltext clustering,” in Proceedings of the 2015 Science and Information Conference (SAI), 2015, pp. 483–488.CrossRefGoogle Scholar
  5. 5.
    H. Schutze, C. D. Manning, and P. Raghavan, Introduction to Information Retrieval (Cambridge Univ. Press, Cambridge, 2008).zbMATHGoogle Scholar
  6. 6.
    C. D. Manning and H. Schutze, Foundations of Statistical Natural language Processing (MIT press, Boston, MA, 1999).zbMATHGoogle Scholar
  7. 7.
    K. V. Vorontsov, “Additive regularization for topic models of text collection,” Dokl. Akad. Nauk 89, 301–304 (2014).MathSciNetzbMATHGoogle Scholar
  8. 8.
    I. Moloshnikov, A. Sboev, D. Gudovskikh, and R. Rybka, “An algorithm of finding thematically similar documents with creating context-semantic graph based on probabilistic-entropy approach,” Proc. Comput. Sci. 66, 297–306 (2015).CrossRefGoogle Scholar
  9. 9.
    M. Nokel and N. Loukachevitch, “Accounting ngrams and multiword terms can improve topic models,” in Proceedings of 12th Workshop on Multiword Expressions (MWE’2016) (ACM, Stroudsburg, PA, USA, 2016), pp. 44–49.CrossRefGoogle Scholar
  10. 10.
    T. Grainger, T. Potter, and Y. Seeley, Solr in action (Cherry Hill, Manning, 2014).Google Scholar
  11. 11.
    P. Glauner, J. Iwaszkiewicz, J.-Y. Meur, and T. Simko, “Use of Solr and Xapian in the Invenio document repository software.” arXiv: 1310.0250 (2013).Google Scholar
  12. 12.
    S. Ilyinsky, M. Kuzmin, A. Melkov, and I. Segalovich, “An efficient method to detect duplicates of Web documents with the use of inverted index,” in Proceedings of the 11th International World Wide Web Conference (WWW2002) (ACM, New York, 2002).Google Scholar
  13. 13.
    M. S. Ageev and B. V. Dobrov, “An efficient nearest neighbours search algorithm for full-text documents,” Vestn. SPb. Univ., Prikl. Mat. Komp’yut. Nauki 3, 72–84 (2011) [in Russian].Google Scholar
  14. 14.
    P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Trans. Assoc. Comput. Linguist. 5, 135–146 (2017).CrossRefGoogle Scholar
  15. 15.
    Logistic Regression. Accessed 2019.
  16. 16.
    Linear Support Vector Classification. Lin-earSVC.html. Accessed 2019.
  17. 17.
    One-vs-the-rest (OvR) Multiclass/Multilabel Strategy. gener-ated/sklearn.multiclass.OneVsRestClassifier.html#sklearn.multiclass. Accessed 2019.
  18. 18.
    Russian Federal Institute of Industrial Property. Accessed 2019.
  19. 19.
    M. Krier and F. Zacca, “Automatic categorisation applications at the European patent office,” World Patent Inform. 24, 187–196 (2002).CrossRefGoogle Scholar
  20. 20.
    C. J. Fall and K. Benzineb, “Literature survey: Issues to be considered in the automatic classification of patents,” World Intell. Property Organiz. 29 (2002).Google Scholar
  21. 21.
    C. J. Fall, A. Torcsvari, K. Benzineb, and G. Karetka, “Automated categorization in the international patent classification,” in Proceedings of the Acm Sigir Forum (ACM, 2003), Vol. 37, pp. 10–25.CrossRefGoogle Scholar
  22. 22.
    A. J. Trappey, F.-C. Hsu, C. V. Trappey, and C.-I. Lin, “Development of a patent document classification and search platform using a back-propagation network,” Expert Syst. Appl. 31, 755–765 (2006).CrossRefGoogle Scholar
  23. 23.
    F. Piroi, M. Lupu, A. Hanbury, A. P. Sexton, W. Magdy, and I. V. Filippov, “Clef-ip 2010: Retrieval experiments in the intellectual property domain,” in Proceedings of the CLEF: Notebook Papers, Labs, Workshops, 2010.Google Scholar
  24. 24.
    S. Verberne and E. D’hondt, “Patent classification experiments with the linguistic classification system LCS in CLEF-IP 2011,” in Proceedings of the CLEF: Notebook Papers, Labs, Workshops, 2011.Google Scholar
  25. 25.
    Y.-L. Chen and Y.-C. Chang, “A three-phase method for patent classification,” Inform. Process. Manage. 48, 1017–1030 (2012).CrossRefGoogle Scholar
  26. 26.
    E. D’hondt, S. Verberne, C. Koster, and L. Boves, “Text representations for patent classification,” Comput. Linguist. 39, 755–775 (2013).CrossRefGoogle Scholar
  27. 27.
    X. Zhang, “Interactive patent classification based on multi-classifier fusion and active learning,” Neurocomputing 127, 200–205 (2014).CrossRefGoogle Scholar
  28. 28.
    S. Arts, B. Cassiman, and J. C. Gomez, “Text matching to measure patent similarity,” Strateg. Manage. J. 39, 62–84 (2018).CrossRefGoogle Scholar

Copyright information

© Pleiades Publishing, Ltd. 2019

Authors and Affiliations

  1. 1.Federal Research Center Computer Science and Control of the Russian Academy of SciencesMoscowRussia
  2. 2.Peoples’ Friendship University of Russia (RUDN University)MoscowRussia
  3. 3.Lomonosov Moscow State UniversityMoscowRussia

Personalised recommendations