Classifying Companies by Industry Using Word Embeddings

  • Martin Lamby
  • Daniel Isemann
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10859)


This contribution investigates whether companies cluster together according to their field of industry using word embeddings and in particular word2vec models on general news text. We explore to what extent this can be utilised for identifying company-industry affiliations automatically. We present an experiment in which we test seven different classification methods on four different word2vec models trained on a 600-million-word corpus from the Guardian newspaper. For training and testing our classifiers we obtained company-industry assignments from the Dbpedia knowledge base for those companies occurring in both the news corpus and Dbpedia. The majority of the 28 scrutinized classification paradigms displays F1 scores near 80%, with some exceeding this threshold. We found differences across industries, with some industries appearing to be more distinctly defined, while others are less clearly delineated from neighbouring fields. To test the robustness of our approach we conducted a field test, identifying candidate companies absent from Dbpedia with a named-entity recognizer, establishing ground truth on company and industry status manually through web search. We found classifier performance to be less reliable in the field test and of varying quality across industries. with precision at 25 values ranging from 16% to 88%, depending on industry. In summary, the presented approach showed some promise, but also some limitations and may in its current form be only robust enough for semi-automated classification.


Company-industry affiliation Company classification from unstructured news text Field of industry clustering Word embedding word2vec 


  1. 1.
    Jordan, M.I., Mitchell, T.M.: Machine learning: trends, perspectives, and prospects. Science 349, 255–260 (2015)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Gupta, V., Lehal, G.S.: A survey of text mining techniques and applications. J. Emerg. Technol. Web Intell. 1, 60–76 (2009)Google Scholar
  3. 3.
    Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E.D., Gutierrez, J.B., Kochut, K.: A brief survey of text mining: classification, clustering and extraction techniques (2017). arXiv Preprint: arXiv:1707.02919
  4. 4.
    Kahle, K.M., Walkling, R.A.: The impact of industry classifications on financial research. J. Financ. Quant. Anal. 31, 309 (1996)CrossRefGoogle Scholar
  5. 5.
    Gopikrishnan, P., Rosenow, B., Plerou, V., Stanley, H.E.: Identifying business sectors from stock price fluctuations (2000)Google Scholar
  6. 6.
    Bernstein, A., Clearwater, S., Provost, F.: The relational vector-space model and industry classification. In: Proceedings of IJCAI 2003 Workshop on Learning Statistical Models from Relational Data, pp. 8–18 (2003)Google Scholar
  7. 7.
    Drury, B., Almeida, J.J.: Identification, extraction and population of collective named entities from business news. In: Entity 2010 – Workshop on Resources and Evaluation for Entity Resolution and Entity Management, LREC 2010, pp. 19–22 (2010)Google Scholar
  8. 8.
    Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th conference on Computational linguistics, vol. 2, pp. 539–545 (1992)Google Scholar
  9. 9.
    Etzioni, O., Cafarella, M., Downey, D., Popescu, A.M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165, 91–134 (2005)CrossRefGoogle Scholar
  10. 10.
    Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, vol. 2, pp. 1003–1011 (2009)Google Scholar
  11. 11.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). arXiv Preprint: arXiv:1301.3781
  12. 12.
    Mikolov, T., Yih, W.-T., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of NAACL-HLT, pp. 746–751 (2013)Google Scholar
  13. 13.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
  14. 14.
    Fu, R., Guo, J., Qin, B., Che, W., Wang, H., Liu, T.: Learning semantic hierarchies via word embeddings. In: ACL, vol. 1, pp. 1199–1209 (2014)Google Scholar
  15. 15.
    Sugathadasa, K., Ayesha, B., de Silva, N., Perera, A.S., Jayawardana, V., Lakmal, D., Perera, M.: Synergistic Union of Word2Vec and Lexicon for Domain Specific Semantic Similarity (2017). arXiv Preprint: arXiv:1706.01967
  16. 16.
    Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc., Sebastopol (2009)zbMATHGoogle Scholar
  17. 17.
    Rehurek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Citeseer (2010)Google Scholar
  18. 18.
    Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ISWC/ASWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). Scholar
  19. 19.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  20. 20.
    Van Der Maaten, L.J.P., Hinton, G.E.: Visualizing high-dimensional data using t-sne. J. Mach. Learn. Res. 9, 2579–2605 (2008)zbMATHGoogle Scholar
  21. 21.
    Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–370. Association for Computational Linguistics (2005)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Chair of Media InformaticsUniversity of RegensburgRegensburgGermany

Personalised recommendations