The Hybrid Method for Accurate Patent Classification
- 4 Downloads
This article is dedicated to stacking of two approaches of patent classification. First is based on linguistically-supported k-nearest neighbors algorithm using the method of search for topically similar documents based on a comparison of vectors of lexical descriptors. Second is the word embeddings based fastText, where the sentence (or a document) vector is obtained by averaging the n-gram embeddings, and then a multinomial logistic regression exploits these vectors as features. We show in Russian and English datasets that stacking classifier shows better results compared to single classifiers.
Keywords and phrasesstacking similarity search KNN word embeddings fastText patent classification
Unable to display preview. Download preview PDF.
We are grateful to the reviewers for careful reading of the manuscript and helpful remarks.
This article presents the research results of the project “Text mining tools for big data” as a part of the program supporting Technical Leadership Centers of the National Technological Initiative “Center for Big Data Storage and Processing” at the Moscow State University (Agreement with Fund supporting the NTI-projects no. 13/1251/2018 11.12.2018). The reported study is partially funded by the Russian Foundation for Basic Research (project no. 16-29-12929) and with the support of the “RUDN University Program 5–100.”
- 2.V. Yadrintsev, A. Bakarov, R. Suvorov, and I. Sochenkov, “Fast and accurate patent classification in search engines,” J. Phys.: Conf. Ser. 1117, 012004 (2018).Google Scholar
- 3.I. V. Sochenkov, D. V. Zubarev, and I. A. Tikhomirov, “Exploratory patent search,” Inform. Prilozh. 12, 89–94 (2018).Google Scholar
- 10.T. Grainger, T. Potter, and Y. Seeley, Solr in action (Cherry Hill, Manning, 2014).Google Scholar
- 11.P. Glauner, J. Iwaszkiewicz, J.-Y. Meur, and T. Simko, “Use of Solr and Xapian in the Invenio document repository software.” arXiv: 1310.0250 (2013).Google Scholar
- 12.S. Ilyinsky, M. Kuzmin, A. Melkov, and I. Segalovich, “An efficient method to detect duplicates of Web documents with the use of inverted index,” in Proceedings of the 11th International World Wide Web Conference (WWW2002) (ACM, New York, 2002).Google Scholar
- 13.M. S. Ageev and B. V. Dobrov, “An efficient nearest neighbours search algorithm for full-text documents,” Vestn. SPb. Univ., Prikl. Mat. Komp’yut. Nauki 3, 72–84 (2011) [in Russian].Google Scholar
- 15.Logistic Regression. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Logistic-Regression.html. Accessed 2019.
- 16.Linear Support Vector Classification. https://scikit-learn.org/stable/modules/generated/sklearn.svm. Lin-earSVC.html. Accessed 2019.
- 17.One-vs-the-rest (OvR) Multiclass/Multilabel Strategy. https://scikit-learn.org/stable/modules/ gener-ated/sklearn.multiclass.OneVsRestClassifier.html#sklearn.multiclass. Accessed 2019.
- 18.Russian Federal Institute of Industrial Property. http://fips.ru/. Accessed 2019.
- 20.C. J. Fall and K. Benzineb, “Literature survey: Issues to be considered in the automatic classification of patents,” World Intell. Property Organiz. 29 (2002).Google Scholar
- 23.F. Piroi, M. Lupu, A. Hanbury, A. P. Sexton, W. Magdy, and I. V. Filippov, “Clef-ip 2010: Retrieval experiments in the intellectual property domain,” in Proceedings of the CLEF: Notebook Papers, Labs, Workshops, 2010.Google Scholar
- 24.S. Verberne and E. D’hondt, “Patent classification experiments with the linguistic classification system LCS in CLEF-IP 2011,” in Proceedings of the CLEF: Notebook Papers, Labs, Workshops, 2011.Google Scholar