LearnSec: A Framework for Full Text Analysis

Gonçalves, Carlos; Iglesias, E. L.; Borrajo, L.; Camacho, Rui; Seara Vieira, A.; Gonçalves, Célia Talma

doi:10.1007/978-3-319-92639-1_42

Carlos Gonçalves^20,22,
E. L. Iglesias²⁰,
L. Borrajo²⁰,
Rui Camacho^21,22,
A. Seara Vieira²⁰ &
…
Célia Talma Gonçalves^23,24

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10870))

Included in the following conference series:

International Conference on Hybrid Artificial Intelligence Systems

2505 Accesses
3 Citations
3 Altmetric

Abstract

Large corpus of scientific research papers have been available for a long time. However, most of those corpus store only the title and the abstract of the paper. For some domains this information may not be enough to achieve high performance in text mining tasks. This problem has been recently reduced by the growing availability of full text scientific research papers. A full text version provides more detailed information but, on the other hand, a large amount of data needs to be processed. A priori, it is difficult to know if the extra work of the full text analysis has a significant impact in the performance of text mining tasks, or if the effect depends on the scientific domain or the specific corpus under analysis.

The goal of this paper is to show a framework for full text analysis, called LearnSec, which incorporates domain specific knowledge and information about the content of the document sections to improve the classification process with propositional and relational learning.

To demonstrate the usefulness of the tool, we process a scientific corpus based on OSHUMED, generating an attribute/value dataset in Weka format and a First Order Logic dataset in Inductive Logic Programming (ILP) format. Results show a successful assessment of the framework.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Hersh, W., Buckley, C., Leone, T.J., Hickam, D.: OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: Croft, B.W., van Rijsbergen, C.J. (eds.) SIGIR 1994. Springer, London (1994). https://doi.org/10.1007/978-1-4471-2099-5_20
Chapter Google Scholar
Moschitti, A., Basili, R.: Complex linguistic features for text classification: a comprehensive study. In: McDonald, S., Tait, J. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181–196. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24752-4_14
Chapter Google Scholar
Muggleton, S., De Raedt, L.: Inductive Logic Programming: theory and methods. J. Logic Program. 19/20, 629–679 (1994)
Article MathSciNet Google Scholar
Eineborg, M., Lindberg, N.: ILP in Part-of-Speech Tagging — an overview. In: Cussens, J., Džeroski, S. (eds.) LLL 1999. LNCS (LNAI), vol. 1925, pp. 157–169. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-40030-3_10
Chapter MATH Google Scholar
Muggleton, S.: Inverse entailment and progol. New Gener. Comput. 1(3–4), 245–286 (1995). Special issue on Inductive Logic Programming
Article Google Scholar
Zhou, W., Smalheiser, N.R., Yu, C.: A tutorial on information retrieval, basics terms and concepts. J. Biomed. Discov. Collab. 1, 2 (2006)
Article Google Scholar
Srinivasan, A.: The aleph manual (2001)
Google Scholar
Gonçalves, C.T., Camacho, R., Oliveira, E.: BioTextRetriever: a tool to retrieve relevant papers. Int. J. Knowl. Discov. Bioinform. 2, 21–36 (2011). IGI Publishing
Article Google Scholar
Gonçalves, C.A., Gonçalves, C.T., Camacho, R., Oliveira, E.: The Impact of pre-processing in classifying MEDLINE documents. In: Proceedings of the 10th International Workshop on Pattern Recognition in Information Systems (PRIS2010), Funchal, Madeira, pp. 53–61 (2010)
Google Scholar
Aprile, A., Castellano, M., Mastronardi, G., Tarricone, G.: A web text mining flexible architecture. Int. J. Comput. Sci. Eng. (2007)
Google Scholar
Oram, P.: WordNet: an electronical lexical database. Appl. Psycholinguist. 22, 131–134 (1998). Cambridge University Press
Article Google Scholar
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Sherlock, G.: Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000)
Article Google Scholar
Rebholz-Schuhmann, D., Pezik, P., Lee, V., Kim, J.J., Del Gratta, R., Sasaki, Y., McNaught, J., Montemagni, S., Monachini, M., Calzolari, N., Ananiadou, S.: BioLexicon: towards a reference terminological resource in the biomedical domain. In: Proceedings of the 16th Annual International Conference on Intelligent Systems for Molecular Biology (2008)
Google Scholar
The Hosford Medical Terms Dictionary v3.0 (2004)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. In: Readings in Information Retrieval, pp. 313–316. Morgan Kaufmann Publishers Inc. (1997)
Google Scholar
Witten, I.H., Eibe, F., Trigg, L., Hall, M., Holmes, G., Cunningham, S.J.: WEKA: practical machine learning tools and techniques with Java implementations. In: Proceedings of the ICONIP/ANZIIS/ANNES99 Future Directions for Intelligent Systems and Information Sciences, pp. 192–196. Morgan Kaufmann (1999)
Google Scholar
Borase, P.N., Kinariwala, S.A.: Image Re-ranking using Information Gain and relative consistency through multi-graph learning. Int. J. Comput. Appl. 147, 29–32 (2016). Foundation of Computer Science, NY, USA
Google Scholar
Hall, M.: Correlation-based feature selection for machine learning. Ph.D. thesis, University of Waikato (1999)
Google Scholar
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Higher Technical School of Computer Engineering, University of Vigo, Campus Universitario As Lagoas s/n, 32004, Ourense, Spain
Carlos Gonçalves, E. L. Iglesias, L. Borrajo & A. Seara Vieira
FEUP-U.Porto, Rua Dr. Roberto Frias s/n, 4200-465, Porto, Portugal
Rui Camacho
LIAAD/INESC TEC, Porto, Portugal
Carlos Gonçalves & Rui Camacho
ISCAP-P.Porto, Rua Jaime Lopes Amorim, s/n, 4465-004, S. Mamede de Infesta, Portugal
Célia Talma Gonçalves
LIACC, U.Porto, Porto, Portugal
Célia Talma Gonçalves

Authors

Carlos Gonçalves
View author publications
You can also search for this author in PubMed Google Scholar
E. L. Iglesias
View author publications
You can also search for this author in PubMed Google Scholar
L. Borrajo
View author publications
You can also search for this author in PubMed Google Scholar
Rui Camacho
View author publications
You can also search for this author in PubMed Google Scholar
A. Seara Vieira
View author publications
You can also search for this author in PubMed Google Scholar
Célia Talma Gonçalves
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carlos Gonçalves .

Editor information

Editors and Affiliations

Department of Mine Operating and Prospection, University of Oviedo, Oviedo, Spain
Francisco Javier de Cos Juez
Department of Computer Science, University of Oviedo, Oviedo, Spain
José Ramón Villar
Department of Computer Science, University of Oviedo, Oviedo, Spain
Enrique A. de la Cal
Department of Civil Engineering, University of Burgos, Burgos, Spain
Álvaro Herrero
University of A Coruña, A Coruña, Spain
Héctor Quintián
University of Salamanca, Salamanca, Spain
José António Sáez
University of Salamanca, Salamanca, Spain
Emilio Corchado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gonçalves, C., Iglesias, E.L., Borrajo, L., Camacho, R., Seara Vieira, A., Gonçalves, C.T. (2018). LearnSec: A Framework for Full Text Analysis. In: de Cos Juez, F., et al. Hybrid Artificial Intelligent Systems. HAIS 2018. Lecture Notes in Computer Science(), vol 10870. Springer, Cham. https://doi.org/10.1007/978-3-319-92639-1_42

Download citation

DOI: https://doi.org/10.1007/978-3-319-92639-1_42
Published: 08 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92638-4
Online ISBN: 978-3-319-92639-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics