TechMiner: Extracting Technologies from Academic Publications

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10024)


In recent years we have seen the emergence of a variety of scholarly datasets. Typically these capture ‘standard’ scholarly entities and their connections, such as authors, affiliations, venues, publications, citations, and others. However, as the repositories grow and the technology improves, researchers are adding new entities to these repositories to develop a richer model of the scholarly domain. In this paper, we introduce TechMiner, a new approach, which combines NLP, machine learning and semantic technologies, for mining technologies from research publications and generating an OWL ontology describing their relationships with other research entities. The resulting knowledge base can support a number of tasks, such as: richer semantic search, which can exploit the technology dimension to support better retrieval of publications; richer expert search; monitoring the emergence and impact of new technologies, both within and across scientific fields; studying the scholarly dynamics associated with the emergence of new technologies; and others. TechMiner was evaluated on a manually annotated gold standard and the results indicate that it significantly outperforms alternative NLP approaches and that its semantic features improve performance significantly with respect to both recall and precision.


Scholarly data Ontology learning Bibliographic data Scholarly ontologies Data mining 



We thank Elsevier for providing us with access to the Scopus repository of scholarly data. We also acknowledge grant n° 159047 from the Swiss National Foundation.


  1. 1.
    Moller, K., Heath, T., Handschuh, S., Domingue, J.: Recipes for semantic web dog food—the ESWC and ISWC metadata projects. In: 6th International Semantic Web Conference, 11–15 November 2007, Busan, South Korea (2007)Google Scholar
  2. 2.
    Glaser, H., Millard, I.: Knowledge-enabled research support: In: Proceedings of Web Science 2009, Athens, Greece (2009)Google Scholar
  3. 3.
    Dumontier, M., Callahan, A., Cruz-Toledo, J., Ansell, P., Emonet, V., Belleau, F., Droit, A.: Bio2RDF release 3: a larger connected network of linked data for the life sciences. In: 2014 International Semantic Web Conference (Posters & Demos) (2014)Google Scholar
  4. 4.
    Carpenter, B.: LingPipe for 99.99 % recall of gene mentions. In: Proceedings of the Second BioCreative Challenge Evaluation Workshop, vol. 23, pp. 307–309 (2007)Google Scholar
  5. 5.
    Corbett, P., Copestake, A.: Cascaded classifiers for confidence-based chemical named entity recognition. BMC Bioinform. 9(11), 1 (2008)Google Scholar
  6. 6.
    Liakata, M., Teufel, S., Siddharthan, A., Batchelor, C.R.: Corpora for the conceptualisation and zoning of scientific papers. In: LREC (2010)Google Scholar
  7. 7.
    Groza, T.: Using typed dependencies to study and recognise conceptualisation zones in biomedical literature. PLoS ONE 8(11), e79570 (2013)CrossRefGoogle Scholar
  8. 8.
    de Ribaupierre, H., Falquet, G.: User-centric design and evaluation of a semantic annotation model for scientific documents. In: Proceedings of the 14th International Conference on Knowledge Technologies and Data-driven (2014)Google Scholar
  9. 9.
    Augenstein, I., Padó, S., Rudolph, S.: LODifier: generating linked data from unstructured text. In: The Semantic Web: Research and Applications, pp. 210–224 (2012)Google Scholar
  10. 10.
    Usbeck, R., Ngonga Ngomo, A.-C., Röder, M., Gerber, D., Coelho, S.A., Auer, S., Both, A.: AGDISTIS - graph-based disambiguation of named entities using linked data. In: Mika, P. (ed.) ISWC 2014. LNCS, vol. 8796, pp. 457–471. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-11964-9_29 Google Scholar
  11. 11.
    Sateli, B., Witte, R.: What’s in this paper? Combining rhetorical entities with linked open data for semantic literature querying. In: Proceedings of the 24th International Conference on World Wide Web Companion, pp. 1023–1028 (2015)Google Scholar
  12. 12.
    Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.: DBpedia-a crystallization point for the web of data. Web Semant. Sci. Serv. Agents World Wide Web 7(3), 154–165 (2009)CrossRefGoogle Scholar
  13. 13.
    Bandrowski, A., Brush, M., Grethe, J.S., Haendel, M.A., Kennedy, D.N., Hill, S., Hof, P.R., Martone, M.E., Pols, M., Tan, S.C., Washington, N.: The resource identification initiative: a cultural shift in publishing. J. Comparat. Neurol. 524(1), 8–22 (2016)CrossRefGoogle Scholar
  14. 14.
    Scanning Douw, K., Vondeling, H., Eskildsen, D., Simpson, S.: Use of the Internet in scanning the horizon for new and emerging health technologies: a survey of agencies involved in horizon scanning. J. Med. Internet Res. 5(1), e6 (2003)CrossRefGoogle Scholar
  15. 15.
    Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)zbMATHGoogle Scholar
  16. 16.
    Osborne, F., Motta, E.: Klink-2: integrating multiple web sources to generate semantic topic networks. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9366, pp. 408–424. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-25007-6_24 CrossRefGoogle Scholar
  17. 17.
    de Ribaupierre, H., Falquet, G.:, An automated annotation process for the SciDocAnnot scientific document model. In: Proceedings of the Fifth International Workshop on Semantic Digital Archives, TPDL 2015 (2015)Google Scholar
  18. 18.
    Osborne, F., Motta, E., Mulholland, P.: Exploring scholarly data with rexplore. In: Alani, H., Kagal, L., Fokoue, A., Groth, P., Biemann, C., Parreira, J.X., Aroyo, L., Noy, N., Welty, C., Janowicz, K. (eds.) ISWC 2013. LNCS, vol. 8218, pp. 460–477. Springer, Heidelberg (2013). doi: 10.1007/978-3-642-41335-3_29 CrossRefGoogle Scholar
  19. 19.
    de Ribaupierre, H., Osborne, F., Motta, E.: Combining NLP and semantics for mining software technologies from research publications. In: Proceedings of the 25th International Conference on World Wide Web (Companion Volume) (2016)Google Scholar
  20. 20.
    Huang, W.: Do ABCs get more citations than XYZs? Econ. Inq. 53(1), 773–789 (2015)CrossRefGoogle Scholar
  21. 21.
    Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: DBpedia spotlight: shedding light on the web of documents. In: Proceedings of the 7th International Conference on Semantic Systems, pp. 1–8. ACM (2011)Google Scholar
  22. 22.
    Peroni, S., Shotton, D.: FaBiO and CiTO: ontologies for describing bibliographic resources and citations. Web Semant. Sci. Serv. Agents World Wide Web 17, 33–43 (2012)CrossRefGoogle Scholar
  23. 23.
    Ibekwe-SanJuan, F., Fernandez, S., Sanjuan, E., Charton, E.: Annotation of scientific summaries for information retrieval (2011). arXiv preprint arXiv:1110.5722
  24. 24.
    O’Seaghdha, D., Teufel, S.: Unsupervised learning of rhetorical structure with un-topic models. In: Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014) (2014)Google Scholar
  25. 25.
    Ronzano, F., Saggion, H.: Dr. inventor framework: extracting structured information from scientific publications. In: Japkowicz, N., Matwin, S. (eds.) DS 2015. LNCS (LNAI), vol. 9356, pp. 209–220. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-24282-8_18 CrossRefGoogle Scholar
  26. 26.
    Bordea, G., Buitelaar, P., Polajnar, T.: Domain-independent term extraction through domain modelling. In: The 10th International Conference on Terminology and Artificial Intelligence (TIA 2013), Paris, France (2013)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.Knowledge Media InstituteThe Open UniversityMilton KeynesUK
  2. 2.Department of Computer ScienceUniversity of OxfordOxfordUK

Personalised recommendations