Keyword Extraction from Company Websites for the Development of Regional Knowledge Maps

  • Christian WartenaEmail author
  • Montserrat Garcia-Alsina
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 454)


Regional Innovation Systems describe the relations between actors, structures and infrastructures in a region in order to stimulate innovation and regional development. For these systems the collection and organization of information is crucial. In the present paper we investigate the possibilities to extract information from websites of companies. Especially we consider faceted classification of companies by keyword extraction using a specialized thesaurus. First we identify a number of challenges that arise when we want to extract information about companies from their websites. Then we describe a small scale experiment in which keywords related to economic sectors and commodities are extracted from the websites of over 200 companies. The experiment shows that the approach is at least feasible for the commodities facet. For the sectors facet the simple keyword extraction methods used do not perform well. We find that a good coverage of words in the text by the thesaurus is crucial and that hence the results can be improved by adding more alternative labels to the thesaurus terms. Furthermore, we find that weighting terms according to their relations to other terms on the website instead of using inverse document frequency gives better results than the classical tf.idf weighting of terms.


Weighting Scheme Economic Sector Inverse Document Frequency National Innovation System Regional Innovation System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



The research presented in this paper was partially funded by the Spanish Ministry of Education, Culture and Sport (Ref. CAS 12/00155).


  1. 1.
    Asheim, B., Gertler, M.: The geography of innovation: regional innovation systems. In: Fagerberg, J., Mowery, D., Nelson, R. (eds.) The Oxford Handbook of Innovation, pp. 291–317. Oxford University Press, Oxford (2005)Google Scholar
  2. 2.
    Barinani, A., Agard, B., Beaudry, C.: Competence maps using agglomerative hierarchical clustering. J. Intell. Manuf. 24(2), 1–12 (2011)Google Scholar
  3. 3.
    Canongia, C.: Synergy between competitive intelligence (CI), knowledge management (KM) and technological foresight (TF) as a strategic model of prospecting — the use of biotechnology in the development of drugs against breast cancer. Biotechnol. Adv. 25(1), 57–74 (2007)CrossRefGoogle Scholar
  4. 4.
    Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: A framework and graphical development environment for robust nlp tools and applications. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 6–12 July, pp. 168–175. ACL, Philadelphia (2002)Google Scholar
  5. 5.
    David, P., Foray, D.: Assessing and expanding the science and technology knowledge base. STI Rev. 14, 13–68 (1995)Google Scholar
  6. 6.
    De Campos, L.M., Fernández-Luna, J.M., Huete, J.F., Romero, A.E.: Automatic indexing from a thesaurus using bayesian networks: application to the classification of parliamentary initiatives. In: Mellouli, K. (ed.) ECSQARU 2007. LNCS (LNAI), vol. 4724, pp. 865–877. Springer, Heidelberg (2007) CrossRefGoogle Scholar
  7. 7.
    Doloreux, D., Nabil, A., Landry, R.: Mapping regional and sectoral characteristics of knowledge-intensive business services: Evidence from the province of Quebec (Canada). Growth Change 39(3), 464–496 (2008)CrossRefGoogle Scholar
  8. 8.
    Doloreux, D., Parto, S.: Regional innovation systems: Current discourse and unresolved issues. Technol. Soc. 27, 133–153 (2005)CrossRefGoogle Scholar
  9. 9.
    Driessen, S., Huijsen, W., Grootveld, M.: A framework for evaluating knowledge-mapping tools. J. Knowl. Manage. 11(2), 109–117 (2007)CrossRefGoogle Scholar
  10. 10.
    Eckert, K., Stuckenschmidt, H., Pfeffer, M.: Interactive thesaurus assessment for automatic document annotation. In: Proceedings of the 4th International Conference on Knowledge Capture, pp. 103–110. ACM (2007)Google Scholar
  11. 11.
    Escorsa, P., Rodriguez, M., Maspons, R.: Technology mapping, business strategy and market opportunities. Compet. Intell. Rev. 11(1), 46–57 (2000)CrossRefGoogle Scholar
  12. 12.
    Färber, M., Rettinger, A.: A semantic wiki for novelty search on documents. In: Proceedings of the 13th Dutch-Belgian Workshop on Information Retrieval, Delft, pp. 60–61 (2013)Google Scholar
  13. 13.
    Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, IJCAI 1999, Stockholm, Sweden, July 31–August 6, pp. 668–673 (1999)Google Scholar
  14. 14.
    Garcia-Alsina, M., Ortoll, E.: La Inteligencia Competitiva: evolución histórica y fundamentos teóricos. Trea, Gijón (2012)Google Scholar
  15. 15.
    Garcia-Alsina, M., Wartena, C., Lieberam-Schmidt, S.: Regional knowledge maps: potentials and challenges. In: Fifth International Conference on Knowledge Management and Information Sharing (KMIS 2013) (2013)Google Scholar
  16. 16.
    Gastmeyer, M.: Standard-thesaurus wirtschaft. Technical report Deutsch Zentralbibliothek für Wirtschaftswissenschaften, Kiel (1998)Google Scholar
  17. 17.
    Gastmeyer, M., Weskamp, W.: Nace-konkordanz. In: Standard-Thesaurus Wirtschaft, vol. 2, Kiel (1998)Google Scholar
  18. 18.
    Gazendam, L., Wartena, C., Brussee, R.: Thesaurus based term ranking for keyword extraction. In: Tjoa, A.M., Wagner, R. (eds.) Database and Expert Systems Applications, DEXA, 10th International Workshop on Text-based Information Retrieval, TIR, pp. 49–53. IEEE (2010)Google Scholar
  19. 19.
    Gazendam, L., Wartena, C., Malaisé, V., Schreiber, G., De Jong, A., Brugman, H.: Automatic annotation suggestions for audiovisual archives: Evaluation aspects. Interdis. Sci. Rev. 34(2–3), 172–188 (2009)CrossRefGoogle Scholar
  20. 20.
    Girardot, J.J.: Evolution of the concept of territorial intelligence within the coordination action of the european network of territorial intelligence. Ricerca e Sviluppo per le politiche sociali 1(1–2), 11–29 (2008)Google Scholar
  21. 21.
    Girardot, J.J., Brunau, É.: Territorial intelligence and innovation for the socio-ecological transition. In: 9th International conference of Territorial Intelligence, ENTI, Strasbourg (2010)Google Scholar
  22. 22.
    Grineva, M.P., Grinev, M.N., Lizorkin, D.: Extracting key terms from noisy and multitheme documents. In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, 20–24 April, pp. 661–670 (2009)Google Scholar
  23. 23.
    Herbaux, P.: Tools for territorial intelligence and generic scientific methods. In: Internationa Annual Conference on Territorial Intelligence. Besançon: 16–17 October 2008Google Scholar
  24. 24.
    Isaac, A., Summers, E.: Skos simple knowledge organization system primer. W3C Working Group Note (August 2009).
  25. 25.
    Jimenez, F., Fernández, I., Menéndez, A.: Los sistemas regionales de innovación: revisión conceptual e implicaciones en américa latina. In: Los Sistemas Regionales de Innovación en América Latina. Banco Interamericano de Desarrollo, Washington (2011)Google Scholar
  26. 26.
    Lundvall, B.A., Christensen, J.L.: Broadening the analysis of innovation systems-competition, organisational change and employment dynamics in the danish system. In: Conceição, P., Heitor, M., Lundvall, B.-A. (eds.) Innovation, Competence Building and Social Cohesion in Europe: Towards a Learning Society, pp. 144–179. Edward Elgar, Cheltenham (2003)Google Scholar
  27. 27.
    Lundvall, B.A., Johnson, B.: The learning economy. J. Ind. Stud. 1(2), 23–42 (1994)CrossRefGoogle Scholar
  28. 28.
    Lundvall, B. (ed.): National Systems of Innovation: Towards a Theory of Innovation and Interactive Learning. Pinter, London (1992) Google Scholar
  29. 29.
    Lundvall, B.A.: Why study national systems and national styles of innovations? Technol. Anal. Strateg. Manag. 10(4), 407–421 (1998)CrossRefGoogle Scholar
  30. 30.
    Malaisé, V., Gazendam, L., Brugman, H.: Disambiguating automatic semantic annotation based on a thesaurus structure. In: Hathout, N., Muller, P. (eds.) Actes de la 14e conférence sur le Traitement Automatique des Langues Naturelles (communications orales), pp. 197–206. Association pour le Traitement Automatique des Langues, Toulouse (2007)Google Scholar
  31. 31.
    Malaisé, V., Isaac, A., Gazendam, L., Brugman, H.: Anchoring dutch cultural heritage thesauri to wordnet: two case studies. In: ACL 2007, pp. 57–63 (2007)Google Scholar
  32. 32.
    Medelyan, O., Witten, I.H.: Thesaurus-based index term extraction for agricultural documents. In: Proceedings of the 6th Agricultural Ontology Service Workshop (2005)Google Scholar
  33. 33.
    Mollo, M.: The survey on territory research in europe, In: International Conference of Territorial Intelligence, Papers on Tools and methods of Territorial Intelligence (MSHE). Besançon (2009)Google Scholar
  34. 34.
    Nahapiet, J., Ghoshal, S.: Social capital, intellectual capital, and the organizational advantage. Acad. Manage. Rev. 23(2), 242–266 (1998)Google Scholar
  35. 35.
    Nelson, R.R. (ed.): National Innovation Systems: A Comparative Study. Oxford University Press, Oxford (1993)Google Scholar
  36. 36.
    Neubert, J.: Bringing the “thesaurus for economics” on the web of linked data. In: Proceedings of the Linked Data on the Web Workshop (LDOW 2009) (2009)Google Scholar
  37. 37.
    OECD, EUROSTAT: Oslo Manual: Guidelines for collecting and interpreting innovation data. OECD Publising and European Commission. 3rd edn. (2005)Google Scholar
  38. 38.
    Robertson, S., Jones, K.: Relevance weighting of search terms. J. Am. Soc. Inform. Sci. 27(3), 129–146 (1976)CrossRefGoogle Scholar
  39. 39.
    Salavisa, I., Vali, M.: Social Networks, Innovation and the Knowledge Economy. Routledge, London (2012)Google Scholar
  40. 40.
    Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Technical report Cornell University (1987).
  41. 41.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)CrossRefGoogle Scholar
  42. 42.
    Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 60, 493–502 (2004)CrossRefGoogle Scholar
  43. 43.
    Tiun, S., Abdullah, R., Kong, T.E.: Automatic topic identification using ontology hierarchy. In: Gelbukh, A. (ed.) CICLing 2001. LNCS, vol. 2004, pp. 444–453. Springer, Heidelberg (2001) CrossRefGoogle Scholar
  44. 44.
    Turney, P.D.: Learning algorithms for keyphrase extraction. Inf. Retr. 2(4), 303–336 (2000)CrossRefGoogle Scholar
  45. 45.
    Wang, J., Liu, J., Wang, C.: Keyword extraction based on PageRank. In: Zhou, Z.-H., Li, H., Yang, Q. (eds.) PAKDD 2007. LNCS (LNAI), vol. 4426, pp. 857–864. Springer, Heidelberg (2007) CrossRefGoogle Scholar
  46. 46.
    Wartena, C., Brussee, R., Gazendam, L., Huijsen, W.: Apolda: A practical tool for semantic annotation. In: Database and Expert Systems Applications, DEXA, 7th International Workshop on Text-based Information Retrieval, TIR, pp. 288–292. IEEE (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  1. 1.University of Applied Sciences and Arts HannoverHannoverGermany
  2. 2.Universitat Oberta de CatalunyaBarcelonaSpain

Personalised recommendations