LearnSec: A Framework for Full Text Analysis
Large corpus of scientific research papers have been available for a long time. However, most of those corpus store only the title and the abstract of the paper. For some domains this information may not be enough to achieve high performance in text mining tasks. This problem has been recently reduced by the growing availability of full text scientific research papers. A full text version provides more detailed information but, on the other hand, a large amount of data needs to be processed. A priori, it is difficult to know if the extra work of the full text analysis has a significant impact in the performance of text mining tasks, or if the effect depends on the scientific domain or the specific corpus under analysis.
The goal of this paper is to show a framework for full text analysis, called LearnSec, which incorporates domain specific knowledge and information about the content of the document sections to improve the classification process with propositional and relational learning.
To demonstrate the usefulness of the tool, we process a scientific corpus based on OSHUMED, generating an attribute/value dataset in Weka format and a First Order Logic dataset in Inductive Logic Programming (ILP) format. Results show a successful assessment of the framework.
KeywordsFull text analyses Text preprocessing Text mining Use of background knowledge Inductive Logic Programming
- 7.Srinivasan, A.: The aleph manual (2001)Google Scholar
- 9.Gonçalves, C.A., Gonçalves, C.T., Camacho, R., Oliveira, E.: The Impact of pre-processing in classifying MEDLINE documents. In: Proceedings of the 10th International Workshop on Pattern Recognition in Information Systems (PRIS2010), Funchal, Madeira, pp. 53–61 (2010)Google Scholar
- 10.Aprile, A., Castellano, M., Mastronardi, G., Tarricone, G.: A web text mining flexible architecture. Int. J. Comput. Sci. Eng. (2007)Google Scholar
- 13.Rebholz-Schuhmann, D., Pezik, P., Lee, V., Kim, J.J., Del Gratta, R., Sasaki, Y., McNaught, J., Montemagni, S., Monachini, M., Calzolari, N., Ananiadou, S.: BioLexicon: towards a reference terminological resource in the biomedical domain. In: Proceedings of the 16th Annual International Conference on Intelligent Systems for Molecular Biology (2008)Google Scholar
- 14.The Hosford Medical Terms Dictionary v3.0 (2004)Google Scholar
- 15.Porter, M.F.: An algorithm for suffix stripping. In: Readings in Information Retrieval, pp. 313–316. Morgan Kaufmann Publishers Inc. (1997)Google Scholar
- 16.Witten, I.H., Eibe, F., Trigg, L., Hall, M., Holmes, G., Cunningham, S.J.: WEKA: practical machine learning tools and techniques with Java implementations. In: Proceedings of the ICONIP/ANZIIS/ANNES99 Future Directions for Intelligent Systems and Information Sciences, pp. 192–196. Morgan Kaufmann (1999)Google Scholar
- 17.Borase, P.N., Kinariwala, S.A.: Image Re-ranking using Information Gain and relative consistency through multi-graph learning. Int. J. Comput. Appl. 147, 29–32 (2016). Foundation of Computer Science, NY, USAGoogle Scholar
- 18.Hall, M.: Correlation-based feature selection for machine learning. Ph.D. thesis, University of Waikato (1999)Google Scholar