1 Introduction

We are drowning in data. Ninety percent of the world’s data today has been created in the last 2 years alone.Footnote 1 Most of it is unstructured text, images, and videos, which is hard to categorize, let alone understand, for human beings.Footnote 2 There are sensor data in (self-driving) cars, smart home and office equipment, social media data, mobile data, data on Internet and browsing behavior, or digital camera images, to name just a few. This explosion of data is accompanied by tremendous progress in data science methods, which can make sense of all the available information. Those methods are fueled by artificial intelligence (AI). And this may just be the beginning (Taddy 2018). The McKinsey Global Institute recently projected that the adoption of AI by firms may follow an S-curve pattern—a slow start given the investment associated with learning and deploying the technology and then acceleration driven by competition and improvements in complementary capabilities (Bughin et al. 2018). At the macro level, they expect that AI could potentially deliver additional economic output of around U$13 trillion by 2030, boosting global GDP by about 1.2% a year. The increased output from efficiency gains and innovations could be passed to workers in the form of wages and to entrepreneurs and firms in the form of profits.Footnote 3

These rapid and ongoing changes in the economic, political, and social spheres also affect the domain of research. Massively improved AI offers better, i.e., cheaper, prediction (Agrawal et al. 2018). Improved prediction capabilities allow us to work with huge data sets that are representative for entire populations, simply because they contain nearly complete data on that population (see Section 4 for an example). Even more, the statistical methods relying on AI—data science methods—allow us to tackle novel types of questions, such as the following: How to study the role of geographic and social proximity for entrepreneurial interactions by using huge social media data sets, i.e., from Twitter, instead of traditional case studies? How can we classify the personalities of more than 1000 CEOs, identify the entrepreneurial ones, and study to what extent being entrepreneurial has positive effects on firm performance? To what degree are entrepreneurial skills and personality traits helpful for workers in all kinds of sectors and jobs? Crucially, these questions—if they could be asked at all—could not be seriously studied, let alone be answered, by traditional empirical methods that have been taught in graduate schools in economics and management in the past decades.

In their seminal article, Shane and Venkataraman (2000) defined entrepreneurship as the identification, evaluation, and exploitation of opportunities. Shane (2012) underlined that entrepreneurship is a process, not a one-time event. The questions listed above relate to opportunities initiated by very recent technological progress, which by itself is an ongoing process. Thus, the very object of entrepreneurship research changes along the development of the technological frontier. Today, due to the availability of much more data and computer power, this frontier is shaped strongly by the state of data science techniques. Today, we can analyze and interpret large amounts of complex and unstructured data and make predictions based on correlations and inductive modeling.

Researchers can benefit by understanding and—where appropriate—embracing statistical methods that are driven by AI algorithms. This process has already started and has had disruptive effects on the social sciences, such as economics (Einav and Levin 2014) and management (George et al. 2014). It has created the new field of computational social science, which may reveal new patterns of individual and group behavior and allow to model economic and social interactions more precisely (Lazer et al. 2009).

We contribute to the entrepreneurship literature in two dimensions. First, in the next section, we describe the most prominent data science methods suitable for entrepreneurship research. The goal is to give the interested reader a concise overview over what is possible technically today, with enough input and references to start educating oneself. Section 2 is complemented by the Appendix, where we provide links to literature and Internet resources and where we also delineate key technical terms and list the most relevant text mining tools and download resources for self-starters. Our second contribution comes in Sections 3 and 4. Section 3 surveys how data science methods have been applied in the entrepreneurship research literature and sketch how they have been used to study important research questions that could not—or not to the same extent—be studied without these techniques. Along these lines, in Section 4, we provide an original analysis of a data set with 7.7 million data points and study the dynamics of demand for entrepreneurial skills in the Dutch population. In Section 5, we conclude by discussing opportunities and risks of data science techniques and relate them to traditional empirical research methods and theory.

2 Data science methods for entrepreneurship research

2.1 Background

In conventional statistical research, you start with the formulation and testing of hypotheses with the help of data, assuming that the data are generated by a given stochastic data model. In data science, by contrast, you churn large volumes of data looking for patterns by using algorithmic models and treating the data mechanism as unknown.Footnote 4 Thus, data science “not only provides new tools, it solves a different problem” (Mullainathan and Spiess 2017, p. 88) and is able to discover complex structures that were not specified in advance (Breiman 2001). In other words, whereas conventional statistics is deductive, data science is inductive: the approaches are complementary.

Data science relies heavily on computational power and computer science to derive knowledge from the unprecedented, exponentially growing, complex, and unstructured data, the so-called big data. By making software autonomous or using iterative feedback to discover associations in data, we can find generalizable patterns and anomalies. Thus, instead of teaching machines to do things, the goal of data science is to design them to “think” for themselves and then allow them access to the mass of available data so they could learn. Moreover, while the human brain can associate two or three dimensions of information with each other, algorithms allow hundreds of dimensions. This leads to a system searching for much more fine-grained associations, clusters, and classifications, extracting meaningful information from the data. As a next step, an understandable structure can be developed to facilitate data-driven decision-making.

Due to the nature of “big data” and the complexity of the algorithms used, data science often requires special ways of data storage, accessibility, and processing. Analyses are often done by using multiple computers and multiple calculation units, the so-called high-performance computing, for instance, Hadoop clusters and Spark Streaming, or parallel virtual environments.Footnote 5 Usually the basic steps for analysis include writing an algorithm, setting up an automated process (script), and linking it with open data protocols and application programming interfaces (APIs). Collecting large amounts of unstructured information often generates a complex information set. With the help of visualization techniques and tools, such as chord charts and network graphs, we can observe clusters within that information and present results of data analyses.

Of course, traditional data sources such as surveys and large administrative data sets (old data) can be analyzed and interpreted with the help of data science techniques, too. The computational power of these techniques allows for a much broader and varied search on existing data, which may lead to the revelation of new patterns and insights even in traditional data sources. A notable example is the use of machine learning techniques on the huge United States Patent and Trademark Office (USPTO) database. Various papers have shown that these methods can improve inventor disambiguation from this database and, thereby, help to add a more accurate understanding of inventor careers (Li et al. 2014; Ventura et al. 2015). Machine learning algorithms cannot only match patents more correctly to inventors, they can also include more information from other useful data sources, for example co-authorships, collaboration variables, and geographic location. Based on this information, “large-scale innovation studies across time and space with visualization of inventor mobility across the United States” (Li et al. 2014, p. 941) are possible with much lower error rates than before. Similarly, disambiguation approaches based on machine learning are more consistent across contexts as they can cope better with varying features and detect the best features automatically and more precisely (Ventura et al. 2015).

2.2 Key data science methods

A multitude of different tools and techniques are available, of which we highlight the most interesting ones for entrepreneurship research. In general, Python, currently the fastest growing (general purpose) programming language, features a large range of very effective scripts and open-source libraries for these tasks.

2.3 Machine learning

Within the field of data science, machine learning (ML) is an advanced field of research dealing with the techniques that teach computers to learn without being programmed explicitly (Samuel 1959). ML is not a synonym for AI, though; it is technically a branch of AI. AI, in fact, is a much broader concept, in which machines mimic cognitive functions of learning and problem solving. Therefore, AI algorithms and machines are able to adapt to different situations and to carry out tasks in a way that we would consider “smart” or “intelligent,” that is, with human-like cognitive functions (OECD 2017; Taddy 2018).

Within ML, the two most important categories are supervised learning and unsupervised learning. Supervised ML is the name of a set of advanced algorithms that use information from known results, the so-called labels, to optimize predictions. Technically, in a supervised learning task a computer learns a relation between some observed input (usually a vector of many predictors) and some desired output (one outcome variable of interest) (Hastie et al. 2009). A supervised learning algorithm analyzes the labeled training data and produces an inferred function to map novel (test) data. Supervised learning helps to predict unseen patterns and to understand which input best predicts the outcome to assess the quality of previously tested predictions/inferences. Therefore, it also serves to reduce the “curse of dimensionality,” for example by using an algorithm for dimensionality reduction such as principal component analysis, where variables that are meaningless in explaining a desired target variable or are possibly correlated, are eliminated by the statistical procedure of orthogonal transformation.

Depending on the type of data, one can choose from regression and classification techniques within supervised ML. If one has to predict continuous values, regression techniques are the way to go, while classification techniques are used in discrete settings; they identify which set of categories (classes) a new observation belongs to. An easy to interpret and widely used classification method is a decision tree. Starting from the root, the training observations are split up as heterogeneously as possible into two subgroups. At each node, the algorithm examines which variable it can best split into two new nodes. In this way, the data is split up further and further, until a stop criterion is met (for example, less than n training observations per node). Depending on the values of the variables, each observation ultimately falls into one class (i.e., a single leaf). The results of a decision tree can be interpreted and graphically displayed relatively easily. However, decision trees are prone to instability: a relatively small change in the data can result in another tree. Thereby, a decision tree has a large “generalization error,” a phenomenon that is also called “overfitting the data,” meaning that it can contain nodes that have been created by specific cases in the training data set, making the model poorly generalizable to other data. This means that the model can only perfectly rationalize a specific outcome based on the given training data but is not able to predict variants that were not used for training.Footnote 6

Deep learning (DL) is a special class of supervised learning algorithms that is frequently used for feature extraction from complex, multidimensional data such as images. For instance, Google uses DL to automatically suggest the next word(s) of a search term when one has started typing a word. DL uses the so-called (artificial) neural networks, which allow computers to more closely mimic human brains while still being faster, more accurate and less biased. Neural networks are especially suited for deriving patterns from (highly) non-linear processes. Depending on the form of the model underlying the DL algorithm, a neural network falls either within the category of supervised learning or within unsupervised learning, which we will explain below.

In a pioneering example, Tan and Koh (1996) trained a neural network based on information from psychological, demographic, and family characteristics to predict entrepreneurial inclination. Results from a survey administered among 200 business undergraduates served as training and testing data to model entrepreneurial inclination in an individual. Then, the neural network predicted inclination in any other person based on knowledge of the imputed social and psychological correlates. In this early case, the ML algorithm had an accuracy of 80% for predicting entrepreneurial inclination in individuals not encountered before.

Machine learning can also be performed unsupervised. Then it is used to learn and establish baseline profiles for different entities. In unsupervised ML, “natural” groups or clusters of observations are made, whereby observations that are “equal” or “close” to each other, belong to the same group. This allows trends and patterns in data to be properly mapped out, for instance, when customers have to be grouped into different segments based on their characteristics so that services can be tailored individually (Alsayat and El-Sayed 2016). In this type of cluster analysis, it is necessary to optimize the number of clusters and to thoroughly investigate the stability of the clusters. The latter can be done by adding noise or using multiple algorithms to check whether a certain change in data gives rise to a new cluster.

For the clustering, a distance metric is often used as (in)equality score. This can be the Euclidean distance or another distance function. Whereas most distances can only be used for numerical and complete data, the Gower’s remote function can deal with both categorical and missing data. The more data there is, the more computationally expensive the choice of an algorithm that minimizes distance. Frequently used algorithms are K-means and K-modes which work with centroids as distance measures and are, thereby, less computationally expensive in terms of the best distance metric.

A very early example of unsupervised learning using a neural network for clustering is Rutherford et al. (2001). This paper uses a so-called self-organizing map (SOM) approach to study the relation of firm size with firm success and survival. Using information from the National Survey of Small Business Finances (NSSBF), Rutherford et al. (2001) classify small firms (having less than 500 employees) into multiple groups based on size and ownership as well as firm characteristics. The 4637 small firms in their sample cluster naturally into two distinct groups: a larger group with 3311 members of very small firms and a smaller group with 1326 members of larger (but still small) firms. Given that these two groups differ significantly on other background characteristics, this early paper provides evidence that differences in firm size and structures matter to predict antecedents of firm survival and success.

Yet another category is reinforcement learning (RL), which differs from standard supervised learning because correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Thus, in reinforcement learning, there is no answer. Instead, the reinforcement agent decides how to perform the given task. The only training data given as feedback to the algorithm is in the form of rewards and punishments. In the absence of training data, it is bound to learn from its experience. Calvano et al. (2018) use RL to experiment with AI pricing agents interacting repeatedly in a controlled environment (computer-simulated marketplaces). Their algorithmic price setting experiments shows that when replacing human decision-making even relatively simple pricing algorithms systematically learn to play sophisticated collusive strategies without communicating with each other at all.

2.4 Text analytics and web data scraping

In addition, data science methods are also suitable for obtaining information from unstructured data, often scraped from the Internet. This is very useful because about 80% of big data is available in unstructured text form, for example in blogs, websites, and social media (Cogburn and Hine 2017). This way, all data sources that relate to natural language can be used, such as open answers, text files, notes from customer contacts, reports or e-mails. There are several useful tools and techniques for handling text, semantic, and social data to extract valuable information from these sources. Here, we describe how and what we can infer from the data and discuss useful techniques for mining and analyzing text data to discover interesting patterns, extract useful knowledge, and support decision-making. Even more information can be found in Section 4.

In addition, Internet log files and the metadata of search engines can provide interesting information about trends over time. Search engines register when and where a search query was performed in their search logs and process this information for the answers provided to subsequent related search queries.Footnote 7 The numbers of searches on certain topics and the presented order of search results often show interesting patterns, which Google Trends makes use of, for instance. In a recent book, Stephenson-Davidowitz (2017) presents research that uses different kinds of Internet data: Google Trends, online search data, information on views and clicks, and even patterns of swipes in mobile apps. A famous example is what happened after Facebook introduced the “News Feed” in 2006. With this function, users would get automated updates of the activities of all their friends. It provoked immediate fierce protests of nearly a million users but Facebook did not remove the News Feed. The company had what Stephenson-Davidowitz calls the “digital truth serum” (Stephenson-Davidowitz 2017, p. 154): numbers on clicks and visits increased tremendously after the introduction of the News Feed. In his book, the author provides many more examples on how to use Internet data to derive new insights in human nature and behavior, especially for sensitive issues such as sexual orientation, sexism, customers’ revealed preferences, and stereotypes.

Mining, clustering, and analyzing these unstructured data sources requires the use of analytical techniques for natural language. This so-called natural language processing (NLP) can be performed in different programming languages, for example, Python or R, and researchers can use well-established packages and toolboxes. Sentiment analysis, for instance, can extract subjective information from language, while topic modeling can discover the abstract “topics” in a collection of documents. Other techniques, such as named entity recognition (NER) or Part-of-Speech (POS) tagging, recognize entities such as organizations, people, locations, dates, time, or currency (NER) or word types such as verb, noun (POS) in text. Box A2 in the Appendix lists the most common concepts and tools in general and Section 4 exemplifies the steps one has to take when working with text data from online sources.

3 Applying data science to entrepreneurship research

In this section, we highlight some recent papers using data science methods for research questions on various aspects of entrepreneurial characteristics, processes, and entrepreneurship (success). ML and/or text analytics have been applied to issues such as funding (via venture capital and via crowdfunding), (product) innovation, inventors’ disambiguation, and entrepreneurial traits. The fundamental contributions of these studies fall in two categories: the utilization of new sources of information and data, advancing the data frontier, and applying novel techniques to existing data and/or problems, thereby advancing the knowledge frontier. We now take them in order.

A classical question in entrepreneurship research relates to the factors predicting a start-ups’ success (Stuart and Abetti 1987; Hisrich et al. 2007). Today, the role of online and social media communication and information for the development, identification, and success of entrepreneurial activities and agents has received a lot of attention. To arrive at deeper, richer, and more fine-grained insights on the entrepreneurial mindset, the so-called digital footprints from social media are increasingly used. Lee et al. (2017) measure overconfidence of CEOs by classifying their messages sent on Twitter. They distinguish “professional CEOs” and “founder CEOs” and find that the latter use more optimistic language on Twitter and during earnings conference calls. Founder CEOs are also more likely to issue earnings forecasts that are too high.

Aggarwal and Singh (2013) show that social media can also be used as a means to an end for entrepreneurial success. They study company blogs across multiple stages of venture capitalists’ decision-making and find that blogging can help managers in getting their products and services selected at the screening stage, but that, beyond that, blogging does not help directly. The authors show that blogs can help indirectly in the last stage of the venture capital process when negotiating a contract with the venture capitalist: blogs (with good coverage) attract the attention of competing venture capitalists, which drives up venture prices, and hence improves the blogger’s outside option.

Since the success of the managerial “upper echelons” perspective, it is rather undisputed that the individual characteristics and values of decision makers have a significant impact on the performance of firms (Andrews 1980; Hambrick and Mason 1984). A key question is how certain managerial characteristics translate into better performance. A popular empirical approach to this question has been to measure actual behavior of decision makers through real-time personal observation (Mintzberg 1973). This time-consuming procedure, however, creates the problem of small sample sizes and suffers from selection issues.

Bandiera et al. (2017) tackle the issue by developing new methodology: First, via daily phone calls with 1114 CEOs or their assistants, they collected 42,233 data points about the decision makers’ diaries. Then they employed an unsupervised learning algorithm (a latent Dirichlet allocation, LDA), which provides them with a complete probabilistic description of time-use patterns, despite the high dimensionality of their data set. The algorithm posits that the actual behavior of each CEO is a mixture of a small number of “pure” behaviors and that the creation of each activity is attributable to one of these pure behaviors. In their case, the algorithm finds two “pure” behaviors and generates a one-dimensional behavior index that represents a CEO as a convex combination of the two pure behaviors. Following Kotter (1999), they classify the first pure behavior as “manager” and the second as “leader:” “manager” refers to more time of the CEO spent in meetings with production-level workers and one-to-one meetings with firm employees or suppliers; “leader” refers to more time spent with top-executives and in interactions with several participants and functions from inside and outside the firm together. Kotter associated “managers” with a focus on monitoring and implementation tasks, whereas “leaders” focus on the creation of organizational alignment and communication across a broad variety of characteristics. Clearly, the characterization of “leaders” is related to the characterization of entrepreneurial skills in the entrepreneurship literature (see Section 4).

As a final step, Bandiera et al. (2017) correlate their managerial behavior index with firms’ balance sheet data and find that “leader” CEOs are more likely to be found in larger and more productive firms: an increase of the behavior index by one standard deviation is associated with an increase of 7% in sales, controlling for a battery of factors. This not only suggests that decision makers with entrepreneurial characteristics can also do well in more established organizations. More important for the study at hand, Bandiera et al. (2017) show an innovative way how to use data science techniques to give a more robust answer to an existing question involving personal characteristics of decision makers. There is a lot of scope to apply this to a host of questions in the entrepreneurship literature.

Obschonka et al. (2017a) use Twitter data to identify the personality traits of superstar entrepreneurs and compare them to the characteristics of superstar managers, “a hitherto understudied population in entrepreneurship research” (p. 14). To do this, the authors use a sample of 106 Twitter accounts of (superstar) entrepreneurs and managers. They analyze information from these accounts by using a novel language-based personality assessment tool that is capable of dealing with the huge number of observations from social media data.Footnote 8 Up to now, traditional, survey-based methods, such as a standard Big Five questionnaire, have been used to assess an individual’s personality traits. In contrast to these subjective and self-reported measures, digital footprints, where individuals willingly and unwillingly spread (personal) information to a large and diverse audience, can be used to derive objective and accurate information, revealing individuals’ true preferences. Obschonka et al. (2017a) show that this new tool delivers valid results for univariate and multivariate analyses of personality differences between (superstar) entrepreneurs and (superstar) managers and that, surprisingly and contrary to earlier findings, the latter category shows more entrepreneurial characteristics than the former one.Footnote 9

Tata et al. (2017) use Twitter data to arrive at the “psycholinguistics of entrepreneurship” and demonstrate that even though entrepreneurs are fundamentally different from the general population, also the organizational life cycle matters for the emotions and sentiments attached to entrepreneurship and to the work-life balance in general. The use of language as a robust means for revealing individuals’ (work-life) concerns, motives, traits, and emotions is not new to the field. For instance, Tausczik and Pennebaker (2010) have shown that language is a robust means for revealing individuals’ work-life concerns and emotions. “Entrepreneurial emotion” is a topic in itself and describes a package of feelings that often come with being an entrepreneur (Cardon et al. 2012), a topic that has gained increased importance through big data and AI as enablers of new self-employed businesses: “Approximately 150 million workers in North America and Western Europe have left the relatively stable confines of organizational life—sometimes by choice, sometimes not—to work as independent contractors” (Petriglieri et al. 2018). However, by using Twitter data for these analyses, Tata et al. (2017) are able to overcome several limitations of traditional data sources, such as surveys. Social media data can not only avoid response and recall biases; they also offer a real-time window into peoples’ thoughts over long periods, for more actors than any existing alternative, at any point in time, and across diverse geographical locations. Moreover, content analysis of Twitter data allows collecting information on emotions, constructs, and concerns simultaneously.

Wang et al. (2017) use Twitter data for yet another type of entrepreneurship research. They apply social network analysis to entrepreneurial networks in the USA to identify and locate entrepreneurs jointly with important regional subtleties within the network. They find that although Twitter enables interactions across geographically (and socially) distant locations, the highest intensity can be detected in regional interactions characterized by similar socioeconomic and demographic profiles. This suggests that, even in our digitally connected world, geographic and social proximity are important for entrepreneurial interactions. Hence, earlier results about the important role of social relationships for entrepreneurship are still valid. See the work of Olav Sorenson (Rickne et al. 2018). For instance, Sorenson (2018) shows that both professional and private social relationships are original reasons for industry concentrations in a small number of places, even when firms do not benefit from this clustering. Wang et al. (2017) extend the research on the relevance of networks in various ways: they simultaneously examine the types of actors engaged in digital networks and the specific regions that are active on the Twitter entrepreneurship domain. Moreover, they analyze the regional characteristics that explain the intensity of activity on this social media platform. The use of big data allows for social network analyses on a much larger scale than when using data from primary survey collection efforts. Thereby, Wang et al. (2017) offer a seminal example of bringing data science into the entrepreneurial social networks literature that has mostly been dominated by case studies (Greve and Salaff 2003). They also demonstrate how the incorporation of additional quantitative and qualitative information can mitigate issues of representativeness inherent in social media data.

Data science methods have also been applied to assess performance of crowdsourcing and crowdfunding platforms. Crowdsourcing taps into a crowd with talent to get a whole bunch of business projects or tasks solved by dividing them into microtasks. These microprojects assigned to a skilled on-demand workforce provide many business opportunities, but can also deliver interesting research questions. Crowdfunding, on the other hand, is a unique form of entrepreneurial finance that combines elements of private and public equity (Cummings et al. 2019). It taps into the power of the crowd to acquire financial support. Platforms match individuals or entities in need of funding with individuals or groups willing to contribute financially, often in the form of microfunding.Footnote 10 Apart from being interesting sources of big data, these platforms themselves can be viewed as big data and datafication phenomena as a result of the ongoing digitization and automation. Both innovative developments use mass collaboration, mostly via online tools, to accomplish certain goals, for example the funding of an idea or a project. There is serious money involved in crowdfundingFootnote 11 and, therefore, reliable predictions on the success rates of these products or projects are important. Obviously, big data and data science methods play an important role also for the internal business processes at a crowdfunding or crowdsourcing platform. With social media promotions, statistics on earlier projects, market dynamics, and other activities, a huge amount of data is generated, which can predict the success of ideas or products based on past analytics results.Footnote 12

Hoornaert et al. (2017) build a ML model to predict the success and failure of business and product ideas generated within the crowd based on 3C’s: its content, the contributor proposing it, and the crowd’s feedback on the idea. A non-linear, supervised algorithm identifies the variables that are most predictive of an idea’s distinctiveness and successful implementation. The authors find that considering immediately available information about the content and contributor improves the ranking performance by around 25% over random idea selection, while adding crowd-related information that accumulates over time further improves performance by nearly up to 50%. The last C, crowd feedback is, thus, the best predictor, but also the one that needs most time to develop.

Courtney et al. (2017) use data from Kickstarter, a large and popular crowdfunding portal. They examine the interplay of three signal types obtained from different sources within the platform on the viability of a certain idea: the direct actions a start-up takes regarding a proposed idea and/or product (the content), its characteristics (mainly crowdfunding experience; the contributor), as well as third-party endorsements (sentiments expressed in backer comments; the crowd). For the last type of signal, the authors implement a novel sentiment analysis technique, with which the underlying tone of textual comments by backers can be derived. This allows for a continuous feedback measure of a large and heterogeneous group of individuals commenting on a project, a major improvement on the dichotomous variable that is usually used to measure third-party endorsements (Courtney et al. 2017).

On a higher level, Hartmann et al. (2016) connect data science and entrepreneurship. They derive a taxonomy of business models used by start-up firms that rely on data as a key resource for business, which they call data-driven business models. Their taxonomy consists of six different types of such business models among start-ups and thereby develops a basis for understanding how start-ups build business models that capture value from data as a key resource.

Whereas the above-cited papers use new (big) data sources, Hoberg and Phillips (2016) apply a novel technique, text analysis, to study an existing administrative database.Footnote 13 They use the product descriptions that firms filed with the US Securities and Exchange Commission (SEC) to develop new time-varying industry classifications. These new, more flexible measures of industry membership are better suited to explain differences in key characteristics across industries, such as profitability, sales growth, and market risk. Information on the text-based network classification is also informative about identifying rival firms. Moreover, these classifications show endogenously how industries and their competitors change due to external shocks and how R&D activities and advertisement are endogenously adjusted to the behavior of relevant competitors. Hoberg and Phillips (2016) combine two central ideas: first, that the product features and bundles a firm offers can be consistently derived from SEC product descriptions and that these descriptions can be used to assign a spatial location based on product descriptions, generating a Hotelling-like product location space for these firms. Second, this study uses text analysis to build a network of firms, in which the similarity of each firm to every other firm is calculated by firm-by-firm pairwise word similarity scores using the original product descriptions. Based on these pairwise similarity scores, firms are grouped into industries and the general industry classification can be interpreted as an unrestricted network of firms. There, a firm’s competitors are analogous to a group of friends on social media, with each firm having its own distinctive set of competitors.

4 A case in point: using NLP to study dynamics of entrepreneurial skill demand in a large population

Complementing the (admittedly selective) broad literature review in the previous section, now we go into some depth. To exemplify the general statements made above, we offer an original analysis of a novel big data set by using various data science methods. Specifically, we study the consequences of the ongoing technological and economic developments on the demand for entrepreneurial skills.

Developing entrepreneurial skills is increasingly seen as important to foster entrepreneurship (Baumol et al. 2007). Several recent articles picked up the call and approached the topic from several disciplinary and methodological angles.Footnote 14 As the purpose of this section is not to review the literature on entrepreneurial skills (for that, see the cited articles) but to exemplify data science methods, we restrict our notion to two observations. First, there is no generally accepted delineation, let alone definition, of “entrepreneurial skills.” Second, one of the restrictions of existing studies using traditional empirical methods is the small number of available data points: the cited articles report sample sizes of 39, 523, and 1126 subjects, respectively. Consequently, it is hard to draw general, robust lessons that can be applied to different contexts than those studied. An additional characteristic of these articles, which is interrelated with sample size, is that they focus on (would-be) entrepreneurs, which does not allow to make statements about the importance of and demand for entrepreneurial skills in the general population. Here, we try to alleviate these constraints.

The starting point is that digitization, automation, and the development of new (adaptable) technologies have an increasing impact on the labor market. The boundary between “ICT jobs” and other professions in which ICT-related skills are required is becoming increasingly blurred. Moreover, the specific skills demanded and the tasks that have to be fulfilled in all occupations have changed considerably in recent years (Spitz-Oener 2006).

These changes have led to increased demand for employees with sufficient digital skills in many countries, including the Netherlands (ROA 2017). For employees who, in the longer term, cannot acquire the necessary digital skills through training and retraining, suitable measures and career development paths are required that avoid insufficient qualifications and, eventually, unemployment. In contrast, research has shown that employees can adapt sufficiently to the changes on the labor market and that the negative effects of digitization and automation might be exaggerated, as many jobs may change but also new jobs will be created (Autor 2015; Arntz et al. 2016). Given that innovative capacity is directly related to economic growth, a lack of people with sufficient digital, technical, and ICT skills, in combination with a broader set of the so-called twenty-first century skills, limits innovative capacity (Obschonka et al. 2017b). Alternatively, abundance of these types of employees helps to mitigate the negative effects on innovative capacity and the labor market (Elliott 2017; McAfee and Brynjolfsson 2017).

What are the dynamics in skill demand on the labor market? What are the consequences for different occupations and for employees with different educational backgrounds and different levels of expertise? How do they affect certain types of professions such as managers, ICT professionals, and employees in non-IT/technical jobs? Prüfer et al. (2019) answer these questions by making use of a novel approach of “labor market analytics” in which information from online vacancies, thus from unstructured (big) Internet data, is combined with information from labor market forecasts, that is, with structured data from administrative sources.Footnote 15 Thereby, an innovative and very rich source of information, as well as a unique dataset is created with which the authors analyze the impact of digitization and automation on the labor market in general, on specific economic sectors, on 371 different occupations, and on 3 types of professions. Prüfer et al. (2019) measure the change in skills requirements over time by taking into account digital, technical, and ICT skills compared to general cognitive and non-cognitive skills.

In an original extension to Prüfer et al. (2019), the current section derives insights on the consequences of the ongoing digitization and automatization on the dynamics of entrepreneurial skills. We distinguish three types of professions: managers, ICT jobs, and non-ICT jobs. This helps to understand how the requirements have changed over time and among types of professions and, thus, not only provides insights into ongoing skills dynamics but also on the need for additional qualifications and retraining of specific groups.

This approach is not without caveats, either. Vacancy data are not necessarily representative and we do not know who applies for a certain vacancy and who is employed in the end. On the other hand, (online) vacancies give a much more fine-grained and real-time picture of labor market demand. This data source can provide information over a longer period of time, for a larger sample, and across various locations. Moreover, vacancy data are less prone to response and recall bias, which are eminent in survey data—even more so as it is fairly expensive to place a (clearly visible and widely distributed) vacancy. Finally, vacancy data are much cheaper than other sources of information such as questionnaires within a representative sample or register data that have to be linked from multiple sources.

4.1 Data and methods

4.1.1 Data

We use data from the vacancy database Jobfeed, which is administered by TextKernel, a tech company.Footnote 16 This online job portal contains more than 95% of all vacancies published on the Dutch labor market in the last 10 years. Therefore, it offers a nearly complete—and hence nearly representative—data set of online job ads in the Netherlands. Jobfeed searches the Internet for new vacancies on a daily basis and applies ML algorithms to crawl for vacancies and filter out redundancies. The data mainly contain (unstructured) text, but Jobfeed also extracts structured data such as profession, education, location, and company name.

We use data for a period of 6 years, from January 2012 until December 2017, in total about 7.7 million vacancies. Most of the vacancies are written in Dutch; about 8% are in English. As long as a candidate or job description is available, we use all vacancies in our analyses; this holds for 7.32 million vacancies relating to 371 different occupations. The candidate and job descriptions contain relevant information about required skills, experience, and education. In addition, we use information gathered from multiple sources, including the Occupational Information Network (O*NET), an online database with information about the knowledge, skills, tasks, training, and experience required for a large number of occupations. Another data source is ISCO (International Standard Classification of Occupations; version ISCO-2008), a classification of 436 professions supplied by the International Labor Organization (ILO). In the ISCO-08 classification, a profession has a skill level (1 to 4) and is a combination of the nature of the work, the required training, and the required experience. Other sources for skills data we used include the EU skills framework, Stackoverflow, and Dbpedia, Wikipedia’s skills database.Footnote 17

4.1.2 Methods

As the collected vacancies from the Internet consist of unstructured text, we apply natural language processing (NLP) techniques. However, an initial step is data pre-processing of the vacancy texts, which helps to improve text mining results. An important step of pre-processing is the removal of stop words, such as articles and prepositions. These words often appear in the candidate and job descriptions, but do not describe skills, education, knowledge, or experience. Examples of this are articles and prepositions. Standard Dutch and English stop word lists exist to remove these stop words. In addition, we have identified high-frequency words that do not provide information about the profile, for instance “experience” or “knowledge,” and removed all this non-usable information from our dataset.Footnote 18

Moreover, we removed structured fields in the Jobfeed database, such as e-mail addresses, telephone numbers and links to websites, by using a so-called regular expression, a sequence of characters that define a search pattern. Using for instance the popular library re (for regular expression operations) in Python allows us to match or search sequences of characters by checking if a given word/phrase is present in a text.Footnote 19 Hence, it is useful for dictionary-based skill extraction. It is also useful for text cleaning operations by matching the specified sequence of characters, for example website links or e-mail addresses, which we then remove because this type of information is not relevant for our analysis and could even have negative effects (for instance, web links could be incorrectly recognized as HTML-skills).

A final step is to normalize the text because in unstructured data words can appear in various forms, such as “required,” “require,” and “requiring.” There are also derived words with similar meaning, such as “entrepreneurial,” “entrepreneur,” and “entrepreneurship.” The purpose of text normalization is to reduce inflections (i.e., derivations) of a word into a common basic form to arrive at a single canonical form the text might not have had before. The form of text normalization that we apply is called stemming, in which ends of words are hacked by applying a heuristic process. To do this in Dutch language, we apply an existing algorithm.Footnote 20

The specific NLP tool we use for this project is the bag-of-words model. This model helps to retrieve information from an unstructured data source by representing a text as the bag (multiset) of its words, disregarding grammar and word order, but keeping information on the frequency of each word and using it as a feature for training a classifier. To make text suitable for analysis, we transformed it into a vector of numbers that relate to the meaning of each word and how it relates to other words. We then applied a mathematical distance measure to calculate the difference (distance) between all the words in our text fragments.

After the pre-processing steps, we can finally extract all necessary information from our text data. Therefore, we categorize the mentioned skills into two unique lists: digital and technical skills and other skills. Because the vacancies are partly in English, we use both Dutch and English skill labels. We also included as many different forms and expressions of skills as possible based on the frequency of words in the vacancy texts. In addition, to make the extraction process of skills more reliable and robust, the entire list of skills is normalized and divided into two parts—skills that contain one character, one word or an abbreviation, and a second list with skills with more than one word. For both categories, the skills are searched within one vacancy. If the exact skill is found in the text, it is counted and if a skill occurs several times within one vacancy, this counts as one. A unigram model was used for the first category. In this model, the text (candidate and job description) is fragmented word by word. The text is first cleaned up partly, for example by removing brackets and conversing everything into lowercase. Also, noise related to line breaks, special characters and white space is removed again by using regular expression. The splitting into words then only needs to happen on a single space while the words can be looked up in the list of skills.Footnote 21 For the skills from the second category, skills with more than one word (the so-called bigrams or trigrams), we used regular expressions to match the skills after having done the necessary cleaning.

As mentioned above, there is no generally accepted definition of entrepreneurial skills. Moreover, there are semantic problems, for example, one job ad could mention “solution-oriented,” whereas another one requires applicants to be “capable of solving problems”; often multiple skills fall into the same category. Therefore, within the other skills list, we created 11 broader categories for the entrepreneurial skills reflecting frequently mentioned skills in the framework of twenty-first century skills and in the entrepreneurship literature (see Table 1).

Table 1 Categories of entrepreneurial skills with examples

4.1.3 Results

Figure 1 shows the ranking of entrepreneurial skills in all vacancies that require at least one entrepreneurial skill. This ranking is based on the cumulative fraction of appearance of the skills of a certain category in all vacancies. Thus, it is the total number of skills appearing in the job descriptions normalized by the total number of jobs of that year in that category. The more often the skills from a certain category are demanded in vacancies, the higher the rank of this category on our heat map (and the darker the color). In other words, this shows the (change in) total demand for the skills in the different categories.

Fig. 1
figure 1

Ranking of entrepreneurial skills overall and per job type (2015-2017)

Overall, communications skills are in highest demand in the years 2015–2017, followed-shortly by self-starter skills.Footnote 22Planning and organization skills, also including the project management skills “agile” and “scrum,” rank highly for managers and ICT professionals.Footnote 23 Other skills categories that are more relevant in these two occupation types than in general are the well-known entrepreneurial skills collaboration and leadership. Surprisingly, creativity and flexibility are less demanded than overall, although the difference is less pronounced for flexibility. In contrast, self-starter skills are ranked first for other professions; flexibility comes third, while planning and organization skills end up on the fourth position.

Moving to the dynamic dimension of our study, if we look at the trend in entrepreneurial skills between 2012 and 2017 (Figs. 2 and 3), we observe an increase in the demand for cooperation (related to communication (by factor 1.0) and collaboration (by factor 1.4) skills) and in skills for planning and organization (by factor 1.4), self-starter (by factor 1.0), computational thinking (by factor 1.2), problem solving (by factor 1.2), and active learning (by factor 1.7). Flexibility (by factor 1.1) and leadership skills (by factor 1.0) are also in increasing in demand, while the remaining skills remain more or less stable. Overall, the demand for active learning skills is rising most in this period, indicating an increasing need for employees that are intrinsically interested in achieving higher skill levels and in lifelong learning. This also highlights the repercussions from the ongoing digitization and automation, which lead to faster technological change and, therefore, impose higher demand for a highly skilled, self-managing, and continuously learning labor force.

Fig. 2
figure 2

Dynamics of top entrepreneurial skills (2012-2017)

Fig. 3
figure 3

Dynamics of bottom entrepreneurial skills (20-12-2017)

Within the class of entrepreneurial skills, we thus find that there is an increase in the demand for communication, collaboration, computational thinking, planning and organizational, self-starter, problem solving, and active learning skills, highlighting the importance of the so-called twenty-first century skills.

Comparing the dynamics in entrepreneurial skills to the dynamics in digital skills and making a distinction between managers and non-managerial professions, we find that demand for entrepreneurial skills has increased by a factor of 1.3 for managers between 2012 and 2017 (Fig. 4) (from a cumulative fraction of 3.17 to 4.07). The demand for this type of skills has also increased slightly for non-managerial occupations (combining ICT/technical job and non-ICT/technical jobs). For digital skills, we find an increase of factor 1.6 for managers (from a cumulative fraction of 0.54 to 0.87) but none for the other occupation types. Prüfer et al. (2019) explain the latter result by steeply increasing demand for skills related to “digital transformation” and “big data and analytics.”

Fig. 4
figure 4

Dynamics of entrepreneurial digital skills

The relatively larger increase for managers’ digital skills (due to their low baseline demand in 2012) notwithstanding (Fig. 4) shows that the cumulative fraction of entrepreneurial skills demanded by managers is significantly larger than the cumulative fraction of managers’ demanded digital skills. Moreover, the absolute demand increase for managers’ entrepreneurial skills over the 5-year period studied (0.9 points) is also larger than the absolute increase for their digital skills (0.4 points).

Summarizing, we conclude that both entrepreneurial and digital skills are in increased demand for managerial positions in the Netherlands over the entire period 2012–2017. Given the hugely growing importance of datafication and our finding that, among digital skills, those on “digital transformation” and “big data and analytics” are most valued by employers, one could expect that demand for digital skills would increase most. Our empirical results, however, show the opposite: entrepreneurial skills were significantly more relevant over the six-year period studied. Moreover, the absolute importance of this skill type in managerial job vacancies has increased even more than digital skills.

5 Discussion and conclusion: opportunities and risks for researchers

The ongoing datafication, coupled with gigantic technological progress in the domain of AI, is changing all aspects of our lives: work, politics, community interactions, economic transactions, and many more. Agrawal et al. (2018, p.194) summarize:

AI can lead to disruption because incumbent firms often have weaker economic incentives than start-ups to adopt the technology. AI-enabled products are often inferior at first because it takes time to train a prediction machine to perform as well as a hard-coded device that follows human instructions rather than learning on its own. However, once deployed, an AI can continue to learn and improve, leaving its unintelligent competitors’ products behind. It is tempting for established companies to take a wait-and-see approach, standing on the sidelines and observing the progress in AI applied to their industry. That may work for some companies, but others may find it difficult to catch up once their competitors get ahead in the training and deployment of AI tools.

Now, substitute “researchers” for “firms”/“companies” in this quotation and “research projects” for “products.”

The disruption occurring at the economy-level is mirrored in the world of research, fueled by developments in data science methods. Distinguishing themselves from traditional statistics and econometrics, these methods use algorithmic models and treat the data mechanism as unknown in order to discover complex structures that were not specified in advance. Where conventional statistics is deductive, data science is inductive. These inductive methods facilitate the automated collection of information, especially on, but not restricted to, the Internet. Via text analysis, computers can learn to understand the meaning of words, relate them to each other, and analyze them at scales that otherwise would require the help of hordes of research assistants. The new techniques and technologies also allow to use many more (unstructured) real-time data sources to conduct analyses that would not have been possible otherwise, for instance by using sensor data from mobile devices (Blumenstock et al. 2015). By making reliance on subjective and self-reported surveys largely unnecessary and substituting these sources with objective data on revealed preferences, they improve the accuracy, robustness and, hence, the value of entrepreneurship research.

Given that these methods are usually freely available and relatively easy to learn, data science techniques thereby contribute to a democratization of empirical research tools, where scholars or students with fewer resources have a higher chance to compete with established researchers from resource-rich countries regarding the types of research questions they can study.

However, given the current state of data science methods, they cannot completely substitute human creativity and research design skills.Footnote 24 According to Agrawal et al. (2018), AI algorithms are better than humans at factoring in complex interactions among different indicators if enough data are available. If this condition does not hold, however, humans are often better than machines when understanding the data generation process confers a prediction advantage. In the social sciences, data science methods appear to be especially well suited for first, inductive analyses that guide further research efforts. This occurs, for instance, by pointing researchers at relevant correlations and helping them to design better (field) experiments, to make better comparisons between more precise populations of interest, and to reveal behavior that was difficult to detect previously (Monroe et al. 2015). The inductive, data-driven approach can also point theorists at the key variables of interest for a specific question that deserve being modeled. This may alleviate the need for expert interviews or the use of small, unrepresentative surveys to obtain a first understanding of the main influence factors for a given research question. In Section 4, we showed the advantages of this approach—and the details how to apply it to a specific question from the domain of entrepreneurship research, the demand dynamics of entrepreneurial skills. Our study, based on a dataset of 95% of all job vacancies in the Netherlands over a 6-year period with 7.7 million data points, has visualized that with data science methods we can study questions that could not have been studied on smaller, non-representative data sets. It has allowed us to state that demand for both entrepreneurial and digital skills has increased for managerial positions but that entrepreneurial skills were significantly more relevant over the entire period 2012–2017 and that the absolute importance of entrepreneurial skills has even increased more than digital skills. This finding may serve as motivation for more research on the role of entrepreneurial skills in the general population—and not only among (would-be) entrepreneurs.

Moreover, data science techniques may also reduce the risk that theorists fall victim to confirmation bias (Mahmoodi et al. 2017). Dreaming ahead, this may lead to a norm for the best theoretical researchers having to motivate their models by the results of big data analyses. Notably, data science methods are no substitute for theoretical research or conventional statistics. They complement those established methodologies. A fruitful avenue for further research is to combine big data and ML with administrative and survey data. In all social sciences, data science techniques have been largely applied to Internet data (often by scraping and analyzing big social media data sets). Entrepreneurship research is no exception, as Section 3 has shown. However, this approach ignores both potential selection effects that are due to differences between online (social media) users and the entire population and measurement errors that are due to the unreliability of social media data as a representative measure of social phenomena. Comparing the results of a (small) representative survey with results of (big) unrepresentative data, of which the representativeness can even be assessed empirically, therefore looks like an ideal way forward for empirical research.Footnote 25

Just as all technologies based on AI, data science methods come with risks. Agrawal et al. (2018) conclude their insightful book on the consequences of AI by focusing on three trade-offs. The first is productivity versus distribution. Bughin et al. (2018) note: “A key challenge is that adoption of AI could widen [performance and outcome] gaps between countries, companies, and workers.” Applied to research, data science methods can increase the number, breadth, and speed of questions we can work on, increasing our productivity. But researchers who neglect technological progress or who miss the train may feel very disadvantaged as some traditional methods may be dominated by data science techniques. Consequently, there may be a watershed moment for every researcher, where she either invests some time to familiarize herself with data science methods (for these the above-described democratization of research tools may kick in), or not (which saves time and effort in the short run but may come at significant risk for the relevance of her research in the long run).

The second trade-off is innovation versus competition. In business, the successes of Google and Facebook, both of which are highly data-driven firms that have embraced AI early, have shown that data-driven markets display first-mover advantages and are prone to market tipping. Importantly, watching the dismal fate of their competitors underlines how important it is not to fall behind.Footnote 26 To some degree, data science methods could introduce a similar spiral, where those researchers who embrace them early could produce higher-quality research, which may have positive feedback effects on their consecutive projects. As long as data sets from one project can be merged and, hence, be partly reused in future projects, the prediction power of those researchers’ models might outcompete latecomers repeatedly, discouraging entry of new researchers in their fields.Footnote 27 The quality of top researchers’ work might be(come) stellar but the competitive supply of answers to important research questions might decrease, giving the top researchers significant opinion leadership.

The third trade-off is performance versus privacy. Using AI successfully depends on huge amounts of data because it is the very power of personalization of services and inference about an individual’s preferences and characteristics that can be made if only sufficient data about other individuals are available.Footnote 28 But the benefits of aggregate data may come at individuals’ costs, especially for privacy.Footnote 29 Doing research by analyzing big data sets with data science methods is subject to the same trade-off as running a firm in a data-driven market. Therefore, such research is subjects to the same laws. As a direct policy response to datafication and AI, the General Data Protection Regulation (GDPR) has become effective in the EU in May 2018, regulating the legal use of privacy-sensitive data, especially those relating to Internet services.Footnote 30 The GDPR is already affecting researchers doing empirical research that uses data from the EU or about EU citizens.Footnote 31

Crucially, the one-to-one translation of the three trade-offs listed by Agrawal et al. (2018) from the business to the research domain is subject to further scrutiny. For instance, it is unclear whether empirical research using data science methods is subject to the same indirect network effects as competition on data-driven markets (which leads to market tipping and one highly dominant firm per market).

By contrast, what is certainly true is that we as researchers need to keep up the standards of verifiability, reliability, and replicability of research results. However, this is particularly difficult when ML algorithms are used because, by definition, the algorithm is learning: it adapts based on feedback.Footnote 32 Therefore, it is harder than with conventional research methods to reproduce predictions (read: results) based on ML. What is necessary, thus, is to make the decision-making processes of algorithms more transparent. This would facilitate trust in the new technologies and replicability would be easier.

One option to achieve this goal is to build algorithms with an internal self-evaluation or calibration stage such that the machine can test its own certainty and report back to the researcher. One attempt in this direction is the Automatic Statistician, which was developed at Cambridge University.Footnote 33 The tool is set up with funding from Google and helps researchers to analyze their datasets while also providing a report in a human-understandable form that explains what it is doing and how certain it is about its predictions. This technology is related to a recent development within ML, Automated Machine Learning (AutoML). This approach tackles the fundamental problems of accountability and verifiability. Here, ML methods and hyper-parameter settings are automatically selected and, thereby, reduce the necessity of handcrafted human interventions. Apart from substantial performance improvements, AutoML can provide evaluations of all tested methods and specifications. Thereby, it can help non-experts to effectively and reliably apply ML techniques.

In all social sciences, including entrepreneurship research, there is a lot of ground to cover.