Abstract
This paper investigates how high-quality, vocabulary-based classifiers, useful for competitive intelligence, can be found for relatively small corpora of publicly available documents. Two corpora of recent annual reports are examined and compared, one in English and one in Chinese. The paper tests whether vocabularies can predict whether firms are relatively innovative or not, examining vocabularies of both content words and function words. We find that indeed the tested vocabularies do produce effective indicators or classifiers and, surprisingly, that function words are especially effective. The paper also provides extensive conceptual and theoretical background to frame the investigation in the context of an EMCUT problematic, that of mapping entities to classification schemes using information derived from text.
Similar content being viewed by others
Notes
We are aware of the distinction that is made in parts of the relevant literature between assignment and matching, where matching is assignment when the entities on all sides are players in a game. Such, for example, is the case in the two-sided matching problem (Gale and Shapley 1962). We note, however, that although we might substitute assignment for matching in the name of our problem, the resulting acronym is less felicitous, there is little risk of confusion in keeping the present name, and indeed there will often be an element of strategic interaction in the composition of the relevant texts.
References
Andrew JP, Manget J, Michael D, Taylor A, Zablit H (2010) Innovation 2010: a return to prominence—and the emergence of a new world order. Boston Consulting Group, Boston, MA
Bird S, Klein E, Loper E (2009) Natural language processing with python. O’Reilly, Sebastopol, CA
Blair DC, Kimbrough SO (2002) Exemplary documents: a foundation for information retrieval design. Inf Process Manage 38(3):363–379
Blair DC, Maron ME (1985) An evaluation of retrieval effectiveness for a full-text document-retrieval system. Commun ACM 28(3):289–299
Bowman EH (1973) Corporate social responsibility and the investor. J Contemp Bus 2:21–43
Bowman EH (1984) No access content analysis of annual reports for corporate strategy and risk. Interfaces 14(1):61–71
Breiman L, Friedman R, Olshen R, Stone C (1984) Classification and regression trees. CRC Press, Boca Raton, FL
Camiciottoli BC (2010) Discourse connectives in genres of financial disclosure: earnings presentations versus earnings releases. J Pragmat 42(3):650–663
Chang LC, Hsu CH, Chang YY (2012) The construction of taiwan’s financial early warning system: the text-mining technique-based analysis. Taiwan Bank Q 63(1):182–217. In Chinese. http://www.bot.com.tw/Publications/Quarterly/Documents/63_1/63_1_7.pdf
Chen GT, Kimbrough S, Lee T (2004) A note on automated support for product application discovery. In: Dutta A, Goes P (eds) Proceedings of the fourteenth annual workshop on information technologies and systems (WITS2004), Washington, DC, pp 128–133
Chen R, Sharman R, Rao H, Upadhyaya S (2007) Design principles for critical incident response systems. Inf Syst E-Bus Manage 5:201–227. doi:10.1007/s10257-007-0046-0
Chou CH, Sinha AP, Zhao H (2008) A text mining approach to Internet abuse detection. Inf Syst e-Bus Manage 6(4):419–439
D’Aveni RA, MacMillan IC (1990) Crisis and content of managerial communications: a study of the focus of attention of top managers in surviving and failing firms. Adm Sci Q 35:634–657
den Hertog P, van der Aa W, de Jong MW (2010) Capabilities for managing service innovation: towards a conceptual framework. J Serv Manage 21(4):490–514
Forsman H, Temel S (2011) Innovation and business performance in small enterprises: an enterprise-level analysis. Int J Innov Manage 15(3):641–665
Gale D, Shapley LS (1962) College admissions and the stability of marriage. Am Math Mon 69(1):9–15
Gebauer J, Tang Y, Baimai C (2008) User requirements of mobile technology: results from a content analysis of user reviews. Inf Syst E-Bus Manage 6(4):361–384
Gottschalk LA (1995) Content analysis of verbal behavior: new findings and clinical applications. Lawrence Erlbaum Associates, Hillsdale, NJ
Gottschalk LA, Gleser GC (1969) The measurement of psychological states through the content analysis of verbal behavior. University of California Press, Berkeley, CA
Gottschalk LA, Winget CN, Gleser GC (1969) Manual of instructions for using the Gottschalk-Gleser content analysis scales: anxiety, hostility, and social alienation—personal disorganization. University of California Press, Berkeley, CA
He ZL, Wong PK (2004) Exploration versus exploitation: an empirical test of the ambidexterity hypothesis. Organ Sci 15(4):481–494
Kabanoff B, Keegan J (2007) Studying strategic cognition by content analysis of annual reports: a validation involving firm innovation. In: Chapman R (eds) Proceedings of managing our intellectual and social capital: 21st ANZAM 2007 Conference, Sydney, Australia, pp 1–14
Kimbrough MR, Kimbrough SO, Murphy P (2011) On using text analytics for event studies. In: Proceedings of the 2011 international conference on artificial intelligence and law (ICAIL 2011)
Kimbrough SO, Lee TY, Oktem U (2012) On deriving indicators from texts. In: Dolk D, Granat J (eds) Modeling for decision support in network-based services, Lecture Notes in Business Information Processing, vol 42. Springer, Berlin, pp 196–225
Kimbrough SO, MacMillan I, Ranieri J (2007) Process and system for matching products and markets. United States Patent 7,257,568 http://www.uspto.gov
Kimbrough SO, MacMillan I, Ranieri J, Thompson JD (2011) Categorized document bases. United States Patent 7,917,519 http://www.uspto.gov
Krippendorff K (2004) Content analysis: an introduction to its methodology, 2nd edn. Sage Publications, Thousand Oaks, CA
Li H, Cai Z, Graesser AC, Duan Y (2012) A comparative study on English and Chinese word uses with LIWC. In: Proceedings of the twenty-fifth international Florida artificial intelligence research society conference, Association for the Advancement of Artificial Intelligence, pp 238–243
Loewenstein J, Ocasio W, Jones C (2012) Vocabularies and vocabulary structure: a new approach linking categories, practices, and institutions. The Academy of Management Annals Available online 13 March 2012.doi:10.1080/19416520.2012.660763
Lukas BA, Ferrell O (2000) The effect of market orientation on product innovation. J Acad Mark Sci 28(2):239–247
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge, UK
March JG (1991) Exploration and exploitation in organizational learning. Organ Sci 2:71–87 http://proxy.library.upenn.edu:2054/login.aspx?direct=true&db=epref&AN=OS.B.GA.MARCH.EEOL&site=ehost-live
Mitchell T (1997) Machine learning. Mcgraw-Hill, New York, NY
Morris R (1994) Computerized content analysis in management research: a demonstration of advantages and limitations. J Manage 20(4):903–931
Muller E, Zenker A (2001) Business services as actors of knowledge transformation: the role of kibs in regional and national innovation systems. Res Policy 30(9):1501–1516
Neuendorf KA (2002) The content analysis guidebook. Sage Publications, Thousand Oaks, CA
Newman ML, Pennebaker JW, Berry DS, Richards JM (2003) Lying words: predicting deception from linguistic styles. Pers Soc Psychol Bull 29(5):665–675
OECD-EUROSTAT (1997) Proposed guidelines for collecting and interpreting technological innovation data. Oslo Manual, 2nd edn. OECD-EUROSTAT, Paris
Oliveira MD, Murphy P (2009) The leader as the face of a crisis: Philip Morris’ CEO’s speeches during the 1990s. J Public Relat Res 21(4):361–80
Pennebaker JW (2011) The secret life of pronouns: what our words say about us. Bloomsbury Press, New York, NY
Prester J, Bozac MG (2012) Are innovative organizational concepts enough for fostering innovation? Int J Innov Manage 16(1):1250005
Raisch S, Birkinshaw J (2008) Organizational ambidexterity: antecedents, outcomes, and moderators. J Manage 34(3):375–409
Shadish WR, Cook TD, Campbell DT (2001) Experimental and quasi-experimental designs for generalized causal inference, 2nd edn. Wadsworth Publishing, New York, NY
Tausczik YR, Pennebaker JW (2010) The psychological meaning of words: LIWC and computerized text analysis methods. J Lang Soc Psychol 29(1):24–54
Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37:141–188
Uotila J, Maula M, Keil T, Zahra SA (2009) Exploration, exploitation, and financial performance: analysis of S&P 500 corporations. Strateg Manage J 30(2):221–231
Vagnani G (2012) Exploration and long-run organizational performance: the moderating role of technological interdependence. J Manage (forthcoming). Published online 6 December 2012 at http://jom.sagepub.com/content/early/2012/12/04/0149206312466146.
Walter F, Battiston S, Yildirim M, Schweitzer F (2012) Moving recommender systems from on-line commerce to retail stores. Inf Syst E-Bus Manage 10:367–393. doi:10.1007/s10257-011-0170-8
Wang HY, Liao C, Kao CH (2012) A credit assessment mechanism for wireless telecommunication debt collection: an empirical study. Inf Syst E-Bus Manage 1–19. doi:10.1007/s10257-012-0192-x
Weber RP (1990) Basic content analysis, 2nd edn. Sage Publications, Newbury Park, CA
Wei CP, Chen YM, Yang CS, Yang C (2010) Understanding what concerns consumers: a semantic approach to product feature extraction from consumer reviews. Inf Syst E-Bus Manage 8:149–167. doi:10.1007/s10257-009-0113-9
Wei CP, Lin YT, Yang CC (2011) Cross-lingual text categorization: conquering language boundaries in globalized environments. Inf Process Manage 47(5):786–804
Yang HC, Hsiao HW, Lee CH (2011) Multilingual document mining and navigation using self-organizing maps. Inf Process Manage 47(5):647–666
Yen CC, Chi DJ, Lin SJ (2008) A study for detecting enterprise financial statement fraud. Asian J Manag Humanit Sci 3(1–4):15–30. In Chinese. http://www.asia.edu.tw/ajmhs/vol%203/02.pdf
Yen CC, Lo LK, Chi DJ, Huang YJ (2009) The integrated methodology of classification and regression trees and random forest for information disclosure prediction: consideration of corporate governance indicator. In: Sixth conferences on operations research society of Taiwan. In Chinese.http://edoc.ypu.edu.tw:8080/paper/antai/2009%E5%B9%B4--%E7%AC%AC%E5%85%AD%E5%B1%86%E5%8F%B0%E7%81%A3%E4%BD%9C%E6%A5%AD%E7%A0%94%E7%A9%B6%E5%AD%B8%E6%9C%83%E7%90%86%E8%AB%96%E8%88%87%E5%AF%A6%E5%8B%99%E5%AD%B8%E8%A1%93%E7%A0%94%E8%A8%8E%E6%9C%83/(42)%E6%95%B4%E5%90%88%E5%88%86%E9%A1%9E%E8%BF%B4%E6%AD%B8%E6%A8%B9%E8%88%87%E9%9A%A8%E6%A9%9F%E6%A3%AE%E6%9E%97%E6%96%BC%E8%B3%87%E8%A8%8A%E6%8F%AD%E9%9C%B2%E9%A0%90%E6%B8%AC%E4%B9%8B%E7%A0%94%E7%A9%B6.pdf
Acknowledgments
We would like to thank the general environment and several people at KSRI (the Karlsruhe Service Research Institute) for discussions that helped to clarify the EMCUT concept and its use in applications pertaining to framework validation and to assessments of well-being. In particular we thank Niels Feldman, Margeret Hall, and Marc Kohler. Finally, thanks to the anonymous referees and the handling editor for a number of constructive and useful comments that have improved the clarity of the paper. We also gratefully acknowledge financial support for this research from the National Science Council of Taiwan (award NSC 101-2410-H-259-079-).
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Kimbrough, S.O., Chou, C., Chen, YT. et al. On developing indicators with text analytics: exploring concept vectors applied to English and Chinese texts. Inf Syst E-Bus Manage 12, 385–415 (2014). https://doi.org/10.1007/s10257-013-0228-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10257-013-0228-x