LearnSec: A Framework for Full Text Analysis

  • Carlos GonçalvesEmail author
  • E. L. Iglesias
  • L. Borrajo
  • Rui Camacho
  • A. Seara Vieira
  • Célia Talma Gonçalves
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10870)


Large corpus of scientific research papers have been available for a long time. However, most of those corpus store only the title and the abstract of the paper. For some domains this information may not be enough to achieve high performance in text mining tasks. This problem has been recently reduced by the growing availability of full text scientific research papers. A full text version provides more detailed information but, on the other hand, a large amount of data needs to be processed. A priori, it is difficult to know if the extra work of the full text analysis has a significant impact in the performance of text mining tasks, or if the effect depends on the scientific domain or the specific corpus under analysis.

The goal of this paper is to show a framework for full text analysis, called LearnSec, which incorporates domain specific knowledge and information about the content of the document sections to improve the classification process with propositional and relational learning.

To demonstrate the usefulness of the tool, we process a scientific corpus based on OSHUMED, generating an attribute/value dataset in Weka format and a First Order Logic dataset in Inductive Logic Programming (ILP) format. Results show a successful assessment of the framework.


Full text analyses Text preprocessing Text mining Use of background knowledge Inductive Logic Programming 


  1. 1.
    Hersh, W., Buckley, C., Leone, T.J., Hickam, D.: OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: Croft, B.W., van Rijsbergen, C.J. (eds.) SIGIR 1994. Springer, London (1994). Scholar
  2. 2.
    Moschitti, A., Basili, R.: Complex linguistic features for text classification: a comprehensive study. In: McDonald, S., Tait, J. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181–196. Springer, Heidelberg (2004). Scholar
  3. 3.
    Muggleton, S., De Raedt, L.: Inductive Logic Programming: theory and methods. J. Logic Program. 19/20, 629–679 (1994)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Eineborg, M., Lindberg, N.: ILP in Part-of-Speech Tagging — an overview. In: Cussens, J., Džeroski, S. (eds.) LLL 1999. LNCS (LNAI), vol. 1925, pp. 157–169. Springer, Heidelberg (2000). Scholar
  5. 5.
    Muggleton, S.: Inverse entailment and progol. New Gener. Comput. 1(3–4), 245–286 (1995). Special issue on Inductive Logic ProgrammingCrossRefGoogle Scholar
  6. 6.
    Zhou, W., Smalheiser, N.R., Yu, C.: A tutorial on information retrieval, basics terms and concepts. J. Biomed. Discov. Collab. 1, 2 (2006)CrossRefGoogle Scholar
  7. 7.
    Srinivasan, A.: The aleph manual (2001)Google Scholar
  8. 8.
    Gonçalves, C.T., Camacho, R., Oliveira, E.: BioTextRetriever: a tool to retrieve relevant papers. Int. J. Knowl. Discov. Bioinform. 2, 21–36 (2011). IGI PublishingCrossRefGoogle Scholar
  9. 9.
    Gonçalves, C.A., Gonçalves, C.T., Camacho, R., Oliveira, E.: The Impact of pre-processing in classifying MEDLINE documents. In: Proceedings of the 10th International Workshop on Pattern Recognition in Information Systems (PRIS2010), Funchal, Madeira, pp. 53–61 (2010)Google Scholar
  10. 10.
    Aprile, A., Castellano, M., Mastronardi, G., Tarricone, G.: A web text mining flexible architecture. Int. J. Comput. Sci. Eng. (2007)Google Scholar
  11. 11.
    Oram, P.: WordNet: an electronical lexical database. Appl. Psycholinguist. 22, 131–134 (1998). Cambridge University PressCrossRefGoogle Scholar
  12. 12.
    Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Sherlock, G.: Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000)CrossRefGoogle Scholar
  13. 13.
    Rebholz-Schuhmann, D., Pezik, P., Lee, V., Kim, J.J., Del Gratta, R., Sasaki, Y., McNaught, J., Montemagni, S., Monachini, M., Calzolari, N., Ananiadou, S.: BioLexicon: towards a reference terminological resource in the biomedical domain. In: Proceedings of the 16th Annual International Conference on Intelligent Systems for Molecular Biology (2008)Google Scholar
  14. 14.
    The Hosford Medical Terms Dictionary v3.0 (2004)Google Scholar
  15. 15.
    Porter, M.F.: An algorithm for suffix stripping. In: Readings in Information Retrieval, pp. 313–316. Morgan Kaufmann Publishers Inc. (1997)Google Scholar
  16. 16.
    Witten, I.H., Eibe, F., Trigg, L., Hall, M., Holmes, G., Cunningham, S.J.: WEKA: practical machine learning tools and techniques with Java implementations. In: Proceedings of the ICONIP/ANZIIS/ANNES99 Future Directions for Intelligent Systems and Information Sciences, pp. 192–196. Morgan Kaufmann (1999)Google Scholar
  17. 17.
    Borase, P.N., Kinariwala, S.A.: Image Re-ranking using Information Gain and relative consistency through multi-graph learning. Int. J. Comput. Appl. 147, 29–32 (2016). Foundation of Computer Science, NY, USAGoogle Scholar
  18. 18.
    Hall, M.: Correlation-based feature selection for machine learning. Ph.D. thesis, University of Waikato (1999)Google Scholar
  19. 19.
    Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Carlos Gonçalves
    • 1
    • 3
    Email author
  • E. L. Iglesias
    • 1
  • L. Borrajo
    • 1
  • Rui Camacho
    • 2
    • 3
  • A. Seara Vieira
    • 1
  • Célia Talma Gonçalves
    • 4
    • 5
  1. 1.Higher Technical School of Computer EngineeringUniversity of VigoOurenseSpain
  2. 2.FEUP-U.PortoPortoPortugal
  3. 3.LIAAD/INESC TECPortoPortugal
  4. 4.ISCAP-P.PortoS. Mamede de InfestaPortugal
  5. 5.LIACCU.PortoPortoPortugal

Personalised recommendations