Abstract
This article is dedicated to stacking of two approaches of patent classification. First is based on linguistically-supported k-nearest neighbors algorithm using the method of search for topically similar documents based on a comparison of vectors of lexical descriptors. Second is the word embeddings based fastText, where the sentence (or a document) vector is obtained by averaging the n-gram embeddings, and then a multinomial logistic regression exploits these vectors as features. We show in Russian and English datasets that stacking classifier shows better results compared to single classifiers.
Similar content being viewed by others
References
D. Eisinger, G. Tsatsaronis, M. Bundschus, U. Wieneke, and M. Schroeder, “Automated patent categorizationand guided patent search using IPC as inspired by MeSH and PubMed,” J. Biomed. Semant, BioMed Centr. 4, S3 (2013).
V. Yadrintsev, A. Bakarov, R. Suvorov, and I. Sochenkov, “Fast and accurate patent classification in search engines,” J. Phys.: Conf. Ser. 1117, 012004 (2018).
I. V. Sochenkov, D. V. Zubarev, and I. A. Tikhomirov, “Exploratory patent search,” Inform. Prilozh. 12, 89–94 (2018).
A. Shvets, D. Devyatkin, I. Sochenkov, I. Tikhomirov, K. Popov, and K. Yarygin, “Detection of current research directions based on fulltext clustering,” in Proceedings of the 2015 Science and Information Conference (SAI), 2015, pp. 483–488.
H. Schutze, C. D. Manning, and P. Raghavan, Introduction to Information Retrieval (Cambridge Univ. Press, Cambridge, 2008).
C. D. Manning and H. Schutze, Foundations of Statistical Natural language Processing (MIT press, Boston, MA, 1999).
K. V. Vorontsov, “Additive regularization for topic models of text collection,” Dokl. Akad. Nauk 89, 301–304 (2014).
I. Moloshnikov, A. Sboev, D. Gudovskikh, and R. Rybka, “An algorithm of finding thematically similar documents with creating context-semantic graph based on probabilistic-entropy approach,” Proc. Comput. Sci. 66, 297–306 (2015).
M. Nokel and N. Loukachevitch, “Accounting ngrams and multiword terms can improve topic models,” in Proceedings of 12th Workshop on Multiword Expressions (MWE’2016) (ACM, Stroudsburg, PA, USA, 2016), pp. 44–49.
T. Grainger, T. Potter, and Y. Seeley, Solr in action (Cherry Hill, Manning, 2014).
P. Glauner, J. Iwaszkiewicz, J.-Y. Meur, and T. Simko, “Use of Solr and Xapian in the Invenio document repository software.” arXiv: 1310.0250 (2013).
S. Ilyinsky, M. Kuzmin, A. Melkov, and I. Segalovich, “An efficient method to detect duplicates of Web documents with the use of inverted index,” in Proceedings of the 11th International World Wide Web Conference (WWW2002) (ACM, New York, 2002).
M. S. Ageev and B. V. Dobrov, “An efficient nearest neighbours search algorithm for full-text documents,” Vestn. SPb. Univ., Prikl. Mat. Komp’yut. Nauki 3, 72–84 (2011) [in Russian].
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Trans. Assoc. Comput. Linguist. 5, 135–146 (2017).
Logistic Regression. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Logistic-Regression.html. Accessed 2019.
Linear Support Vector Classification. https://scikit-learn.org/stable/modules/generated/sklearn.svm. Lin-earSVC.html. Accessed 2019.
One-vs-the-rest (OvR) Multiclass/Multilabel Strategy. https://scikit-learn.org/stable/modules/ gener-ated/sklearn.multiclass.OneVsRestClassifier.html#sklearn.multiclass. Accessed 2019.
Russian Federal Institute of Industrial Property. http://fips.ru/. Accessed 2019.
M. Krier and F. Zacca, “Automatic categorisation applications at the European patent office,” World Patent Inform. 24, 187–196 (2002).
C. J. Fall and K. Benzineb, “Literature survey: Issues to be considered in the automatic classification of patents,” World Intell. Property Organiz. 29 (2002).
C. J. Fall, A. Torcsvari, K. Benzineb, and G. Karetka, “Automated categorization in the international patent classification,” in Proceedings of the Acm Sigir Forum (ACM, 2003), Vol. 37, pp. 10–25.
A. J. Trappey, F.-C. Hsu, C. V. Trappey, and C.-I. Lin, “Development of a patent document classification and search platform using a back-propagation network,” Expert Syst. Appl. 31, 755–765 (2006).
F. Piroi, M. Lupu, A. Hanbury, A. P. Sexton, W. Magdy, and I. V. Filippov, “Clef-ip 2010: Retrieval experiments in the intellectual property domain,” in Proceedings of the CLEF: Notebook Papers, Labs, Workshops, 2010.
S. Verberne and E. D’hondt, “Patent classification experiments with the linguistic classification system LCS in CLEF-IP 2011,” in Proceedings of the CLEF: Notebook Papers, Labs, Workshops, 2011.
Y.-L. Chen and Y.-C. Chang, “A three-phase method for patent classification,” Inform. Process. Manage. 48, 1017–1030 (2012).
E. D’hondt, S. Verberne, C. Koster, and L. Boves, “Text representations for patent classification,” Comput. Linguist. 39, 755–775 (2013).
X. Zhang, “Interactive patent classification based on multi-classifier fusion and active learning,” Neurocomputing 127, 200–205 (2014).
S. Arts, B. Cassiman, and J. C. Gomez, “Text matching to measure patent similarity,” Strateg. Manage. J. 39, 62–84 (2018).
Acknowledgments
We are grateful to the reviewers for careful reading of the manuscript and helpful remarks.
Funding
This article presents the research results of the project “Text mining tools for big data” as a part of the program supporting Technical Leadership Centers of the National Technological Initiative “Center for Big Data Storage and Processing” at the Moscow State University (Agreement with Fund supporting the NTI-projects no. 13/1251/2018 11.12.2018). The reported study is partially funded by the Russian Foundation for Basic Research (project no. 16-29-12929) and with the support of the “RUDN University Program 5–100.”
Author information
Authors and Affiliations
Corresponding authors
Additional information
Submitted by Vl. V. Voevodin
Rights and permissions
About this article
Cite this article
Yadrintsev, V.V., Sochenkov, I.V. The Hybrid Method for Accurate Patent Classification. Lobachevskii J Math 40, 1873–1880 (2019). https://doi.org/10.1134/S1995080219110325
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S1995080219110325