MACJa: Metadata and Citations Jailbreaker

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 548)


This paper presents the Metadata And Citations Jailbreaker (a.k.a. MACJa – IPA /’matsja/), i.e., a method for processing the research papers available in and stored as PDF files in order to extract relevant semantic data and publish them in a RDF triplestore according to the Semantic Publishing And Referencing (SPAR) Ontologies. In particular, the extraction of all the information needed for addressing the queries of the Semantic Publishing Challenge 2015 (task 2) is guaranteed by MACJa by using techniques based on Natural Language Processing (i.e., Combinatory Categorial Grammar, Discourse Representation Theory, Linguistic Frames), Semantic Web technologies and good Ontology Design practices (i.e., Content Analysis, Ontology Design Patterns, Discourse Referent Extraction and Linking, Topic Extraction).


MACJa SPAR Ontologies Semantic Publishing 


  1. 1.
    Agirre, E., Soroa, A.: Personalizing pagerank for word sense disambiguation. In: EACL, Athens, Greece, 2009. The Association for Computer Linguistics (2009)Google Scholar
  2. 2.
    Bertin, M., Atanassova, I.: Hybrid Approach for the Semantic Processing of Scientific Papers. In Semantic Publishing Challenge (2014)Google Scholar
  3. 3.
    Bos, J.: Wide-coverage semantic analysis with boxer. In: Bos, J., Delmonte, R. (eds.) Semantics in Text Processing, pp. 277–286. College Publications, London (2008)Google Scholar
  4. 4.
    Constantin, A., Steve, P., Andrei, V.: PDFX: fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the 2013 ACM Symposium on Document Engineering, pp. 177–180. ACM, New York (2013). doi: 10.1145/2494266.2494271
  5. 5.
    d’Aquin, M., Baldassare, C., Gridinoc, L., Sabou, M., Angeletou, S., Motta, E.: Watson: supporting next generation semantic web applications. In: Proceedings of WWW/Internet Conference 2007 (2007)Google Scholar
  6. 6.
    Das, S., Sundara, S., Cyganiak, R.: R2RML: RDB to RDF Mapping Language. W3C recommendation (2012).
  7. 7.
    Di Iorio, A., Nuzzolese, A.G., Peroni, S.: Towards the automatic identification of the nature of citations. In: Castro, A.G., Lange, C., Lord, P.W., Stevens, R. (eds.) SePublica. CEUR Workshop Proceedings, vol. 994, pp. 63–74. (2013)Google Scholar
  8. 8.
    Di Iorio, A., Nuzzolese, A.G., Peroni, S., Shotton, D., Vitali, F.: Describing bibliographic references in RDF. In: Castro, A.G., Lange, C., Lord, P., Stevens, R. (eds.) Proceedings of 4th Workshop on Semantic Publishing (SePublica 2014) (2014).
  9. 9.
    Dimou, A., Vander Sande, M., Colpaert, P., De Vocht, L., Verborgh, R., Mannens, E., Van de Walle, R.: Extraction and Semantic Annotation of Workshop Proceedings in HTML Using RML. In: Presutti, V., Stankovic, M., Cambria, E., Cantador, I., Di Iorio, A., Di Noia, T., Lange, C., Reforgiato Recupero, D., Tordai, A. (eds.) SemWebEval 2014. CCIS, vol. 475, pp. 114–119. Springer, Heidelberg (2014) Google Scholar
  10. 10.
    Dimou, A., Vander Sande, M., Colpaert, P., Mannens, E., Van De Walle, R.: Extending R2RML to a source-independent mapping language for RDF. In: Proceedings of the ISWC 2013 Posters & Demonstrations Track. CEUR-WS (2013)Google Scholar
  11. 11.
    Gangemi, A.: A comparison of knowledge extraction tools for the semantic web. In: Cimiano, P., Corcho, O., Presutti, V., Hollink, L., Rudolph, S. (eds.) ESWC 2013. LNCS, vol. 7882, pp. 351–366. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  12. 12.
    Gangemi, A., Draicchio, F., Presutti, V., Nuzzolese, A.G., Recupero, D.R.: A machine reader for the semantic web. In: Blomqvist, E., Groza, T. (eds.) International Semantic Web Conference (Posters & Demos). CEUR Workshop Proceedings, vol. 1035, pp. 149–152. (2013)Google Scholar
  13. 13.
    Gangemi, A., Nuzzolese, A.G., Presutti, V., Draicchio, F., Musetti, A., Ciancarini, P.: Automatic typing of DBpedia entities. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012, Part I. LNCS, vol. 7649, pp. 65–81. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  14. 14.
    Gangemi, A., Presutti, V., Reforgiato Recupero, D.: Frame-based detection of opinion holders and topics: a model and a tool. IEEE Comp. Int. Mag. 9(1), 20–30 (2014)CrossRefGoogle Scholar
  15. 15.
    Garcia, A., Murray-Rust, P., Burns, G.A., Stevens, R., Tkaczyk, D., McLaughlin, C., Belin, A., Di Iorio, A., García, L., Gruson-Daniel, C., Mounce, R., Nuzzolese, A.G., Peroni, S., Spinks, J., Villazon-Terrazas, B., Corcho, O., Giraldo, O.: Wabiszewski, M.: PDFJailbreak-a communal architecture for making biomedical PDFs semantic. In Proceedings of BioLINK SIG (2013)Google Scholar
  16. 16.
    Kamp, H.: A theory of truth and semantic representation. In: Groenendijk, J.A.G., Janssen, T.M.V., Stokhof, M.B.J. (eds.) Formal Methods in the Study of Language, vol. 1, pp. 277–322. Mathematisch Centrum (1981)Google Scholar
  17. 17.
    Lafferty, J., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco (2001)Google Scholar
  18. 18.
    Lange, C., Di Iorio, A.: Semantic publishing challenge – assessing the quality of scientific output. In: Presutti, V., et al. (eds.) SemWebEval 2014. CCIS, vol. 475, pp. 61–76. Springer, Heidelberg (2014) Google Scholar
  19. 19.
    Luong, M.T., Dung Nguyen, T., Kan, M.Y.: Logical structure recovery in scholarly articles with rich document features. Int. J. Digit. Libr. Syst. (IJDLS) 1(4), 1–23 (2010)CrossRefGoogle Scholar
  20. 20.
    Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60 (2014)Google Scholar
  21. 21.
    Moro, A., Raganato, A., Navigli, R.: En-tity linking meets word sense disambiguation: a unified approach. Trans. Assoc. Comput. Linguist. 2, 231–244 (2014)Google Scholar
  22. 22.
    Navigli, R., Ponzetto, S.P.: BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell. 193, 217–250 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
  24. 24.
    PDFMiner: Python PDF parser and analyzer (2010)Google Scholar
  25. 25.
    Peroni, S.: Semantic Web Technologies and Legal Scholarly Publishing. Law, Governance and Technology Series 15. Springer, New York (2014). ISBN 978-3-319-04776-8 CrossRefGoogle Scholar
  26. 26.
    Peroni, S., Shotton, D.: FaBiO and CiTO: ontologies for describing bibliographic resources and citations. Web Semant. Sci. Serv. Agents World Wide Web 17, 33–43 (2012). doi: 10.1016/j.websem.2012.08.001 CrossRefGoogle Scholar
  27. 27.
    Peroni, S., Shotton, D., Vitali, F.: Scholarly publishing and linked data: describing roles, statuses, temporal and contextual extents. In: Sack, H., Pellegrini, T., (eds.) Proceedings of the 8th International Conference on Semantic Systems (i-Semantics 2012), pp. 9–16. ACM Press, New York (2012). doi: 10.1145/2362499.2362502
  28. 28.
    Presutti, V., Consoli, S., Nuzzolese, A.G., Recupero, D.R., Gangemi, A., Bannour, I., Zargayouna, H.: Uncovering the semantics of wikipedia wikilinks. In: 19th International Conference on Knowledge Engineering and Knowledge Management (EKAW 2014) (2014)Google Scholar
  29. 29.
    Presutti, V., Draicchio, F., Gangemi, A.: Knowledge extraction based on discourse representation theory and linguistic frames. In: ten Teije, A., Völker, J., Handschuh, S., Stuckenschmidt, H., d’Acquin, M., Nikolov, A., Aussenac-Gilles, N., Hernandez, N. (eds.) EKAW 2012. LNCS, vol. 7603, pp. 114–129. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  30. 30.
    Recupero, D.R., Consoli, S., Gangemi, A., Nuzzolese, A.G., Spampinato, D.: A semantic web based core engine to efficiently perform sentiment analysis. In: Presutti, V., Blomqvist, E., Troncy, R., Sack, H., Papadakis, I., Tordai, A. (eds.) ESWC Satellite Events 2014. LNCS, vol. 8798, pp. 245–248. Springer, Heidelberg (2014) Google Scholar
  31. 31.
    Recupero, D.R., Presutti, V., Consoli, S., Gangemi, A., Nuzzolese, A.G.: Sentilo: frame-based sentiment analysis. Cogn. Comput. 7, 211–225 (2014)Google Scholar
  32. 32.
    Shotton, D.: Semantic publishing: the coming revolution in scientific journal publishing. Learn. Publ. 22(2), 85–94 (2009)CrossRefGoogle Scholar
  33. 33.
    Tkaczyk, D., Szostek, P., Jan Dendek, P., Fedoryszak, M., Bolikowski, L.: CERMINE - automatic extraction of metadata and references from scientific literature. In: Proceedings of the 11th IAPR International Workshop on Document Analysis Systems, pp. 217–221 (2014)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Semantic Technology Laboratory, ISTC-CNRRomeItaly
  2. 2.Department of Computer Science and EngineeringUniversity of BolognaBolognaItaly

Personalised recommendations