Introduction to Rare-Event Predictive Modeling for Inferential Statisticians—A Hands-On Application in the Prediction of Breakthrough Patents

Hain, Daniel; Jurowetzki, Roman

doi:10.1007/978-3-030-98689-6_5

Daniel Hain⁶ &
Roman Jurowetzki⁶

Part of the book series: Studies in Systems, Decision and Control ((SSDC,volume 427))

Included in the following conference series:

International Econometric Conference of Vietnam

854 Accesses

Abstract

Recent years have seen a substantial development of quantitative methods, mostly led by the computer science community with the goal to develop better machine learning application, mainly focused on predictive modeling. However, economic, management, and technology forecasting research has up to now been hesitant to apply predictive modeling techniques and workflows. In this paper, we introduce to a machine learning (ML) approach to quantitative analysis geared towards optimizing the predictive performance, contrasting it with standard practices inferential statistics which focus on producing good parameter estimates. We discuss the potential synergies between the two fields against the backdrop of this at first glance, target-incompatibility. We discuss fundamental concepts in predictive modeling, such as out-of-sample model validation, variable and model selection, generalization and hyperparameter tuning procedures. Providing a hands-on predictive modelling for an quantitative social science audience, while aiming at demystifying computer science jargon. We use the example of high-quality patent identification guiding the reader through various model classes and procedures for data preprocessing, modelling and validation. We start of with more familiar easy to interpret model classes (Logit and Elastic Nets), continues with less familiar non-parametric approaches (Classification Trees and Random Forest) and finally presents artificial neural network architectures, first a simple feed-forward and then a deep autoencoder geared towards anomaly detection. Instead of limiting ourselves to the introduction of standard ML techniques, we also present state-of-the-art yet approachable techniques from artificial neural networks and deep learning to predict rare phenomena of interest.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Often, the challenge in adapting ML techniques for social science problems can be attributed to two issues: (1) Technical lock-ins and (2) Mental lock-ins against the backdrop of paradigmatic contrasts between research traditions. For instance, many ML techniques are initially demonstrated at a collection of—in the ML and Computer Science—well known standard datasets with specific properties. For an applied statistician particularly in social science, however, the classification of Netflix movie ratings or the reconstruction of handwritten digits form the MNIST data-set may appear remote or trivial. These two problems are addressed by contrasting ML techniques with inferential statistics approaches, while using the non-trivial example of patent quality prediction which should be easy to comprehend for scholars working in social science disciplines such as economics.
2.
We here blatantly draw from stereotypical workflows inherent to the econometrics and ML discipline. We apologize for offending whoever does not fit neatly in one of these categories.
3.
At the point where our \(R^2\) exceeds a threshold somewhere around 0.1, we commonly stop worrying about it.
4.
As the name already suggest, this simply expresses by how much our prediction is on average off: \({RMSE} ={\sqrt{\frac{\sum _{i=1}^{n}({\hat{y}}_{i}-y_{i})^{2}}{n}}}.\)
5.
Interestingly, quite some techniques associated with identification strategies which are popular among econometricians, such as the use of instrumental variables, endogeneous selection models, fixed and random effect panel regressions, or vector autogregressions, are little known by the ML community.
6.
Such k-fold cross-validations can be conveniently done in R with the caret, and in Python with the scikit-learn package.
7.
However, one instantly recognizes the similarity to the nowadays common practice among econometricians to bootstrap standard errors by computing them over different subsets of data. The difference here is that we commonly use this procedure, (i) to get more robust parameter estimates instead of evaluating the model’s overall goodness-of-fit, and (ii) we compute them on subsets of the same data the model as fitted on.
8.
For exhaustive surveys on regularization approaches in machine learning particularly focused on high-dimensional data, consider Wainwright (2014), Pillonetto et al. (2014).
9.
Bootstrapping is a technique most applied econometricians are well-acquainted with, yes used for a slightly different purpose. In econometrics, bootstrapping represents a powerful way to circumvent problems arising out of selection bias and other sampling issues, where the regression on several subsamples is used to adjust the standard errors of the estimates.
10.
This is often not the case for typical ML problems, drawing from large numbers of observations and/or a large set of variables. Here, distributed or cloud-based workflows become necessary. We discuss the arising challenges elsewhere (e.g., Hain and Jurowetzki 2020).
11.
For a recent and exhaustive review on patent quality measures, including all used in this exercise, consider Squicciarini et al. (2013).
12.
While the described process appears rather tedious by hand, specialized ML packages such as caret in R provide efficient workflows to automatize the creation of folds as well as hyperparamether grid search.
13.
For an exhaustive overview on model and variable selection algorithms consider Castle et al. (2009).
14.
For an exhaustive discussion on the use of LASSO, consider Belloni et al. (2014). Elastic nets are integrated, among others, in the R package Glmnet, and Python’sscikit-learn.
15.
There are quite many packages dealing with different implementations of regression trees in common data science environments, such as rpart, tree, party for R, and again the machine learning allrounder scikit-learn in Python. For a more exhaustive introduction to CART models, consider Strobl et al. (2009).
16.
Indeed, it is worth mentioning here that many model tuning techniques are based on the idea that adding randomness to the prediction process—somewhat counter-intuitively—increases the robustness and out-of-sample prediction performance of the model.
17.
Just to give an example, Mullainathan and Spiess (2017) demonstrate how a LASSO might select very different features in every fold.
18.
It has to be stressed that even though neural networks are indeed inspired by the most basic concept of how a brain works, they are by no means mysterious artificial brains. The analogy goes as far as the abstraction that a couple of neurons that are interconnected in some architecture. The neuron is represented as some sigmoid function (somewhat like a logistic regression) which decides based on the inputs received if it should get activated and send a signal to connected neurons, which might again trigger their activation. Having that said, calling a neural network an artificial brain is somewhat like calling a paper-plane an artificial bird.
19.
for the sake of simplicity here we will not distinguish between the simple perceptron model, sigmoid neurons or the recently more commonly used rectified linear neurons (Glorot et al. 2011).
20.
This complex algorithm adjusts simultaneously all weight in the network, considering the individual contribution of the neuron to the error.
21.
For some overview on other methods using similar logic, consider: Wang (2005), Zhou and Lang (2003), Shyu et al. (2003).
22.
Variational autoencoers are a slightly more modern and interesting take on this class of models which also performed well in our experiments. Following the KISS principle, we decided to use the more traditional and simpler autoencoder architecture that is easier to explain and performed almost equally well.

References

Aguinis, H., Pierce, C.A., Bosco, F.A., Muslin, I.S.: First decade of organizational research methods: trends in design, measurement, and data-analysis topics. Organ. Res. Methods 12(1), 69–112 (2009)
Article Google Scholar
Ahuja, G., Lampert, C.: Entrepreneurship in the large corporation: a longitudinal study of how established firms create breakthrough inventions. Strateg. Manag. J. 22(6–7), 521–543 (2001)
Article Google Scholar
An, J., Cho, S.: Variational autoencoder based anomaly detection using reconstruction probability. Rep, SNU Data Mining Center, Tech (2015)
Google Scholar
Andrews, R.J., Fazio, C., Guzman, J., Stern, S.: The startup cartography project: A map of entrepreneurial quality and quantity in the united states across time and location. MIT Working Paper (2017)
Google Scholar
Athey, S., Imbens, G.W.: The state of applied econometrics: causality and policy evaluation. J. Econ. Perspect. 31(2), 3–32 (2017)
Article Google Scholar
Basberg, B.L.: Patents and the measurement of technological change: a survey of the literature. Res. Policy 16(2–4), 131–141 (1987)
Article Google Scholar
Belloni, A., Chernozhukov, V., Hansen, C.: Inference on treatment effects after selection among high-dimensional controls. Rev. Econ. Stud. 81(2), 608–650 (2014)
Article MathSciNet Google Scholar
Castle, J.L., Qin, X., Reed, W.R., et al.: How to pick the best regression equation: A review and comparison of model selection algorithms. Working Paper No. 13/2009, Department of Economics and Finance, University of Canterbury (2009)
Google Scholar
Einav, L., Levin, J.: The data revolution and economic analysis. Innov. Policy Econ. 14(1), 1–24 (2014)
Google Scholar
Einav, L., Levin, J.: Economics in the age of big data. Science 346(6210), 1243089 (2014)
Article Google Scholar
Ernst, H.: Patent applications and subsequent changes of performance: evidence from time-series cross-section analyses on the firm level. Res. Policy 30(1), 143–157 (2001)
Article Google Scholar
Fazio, C., Guzman, J., Murray, F., and Stern, S.: A new view of the skew: quantitative assessment of the quality of American entrepreneurship. MIT Innovation Initiative Paper (2016)
Google Scholar
Friedman, J.H., Popescu, B.E.: Predictive learning via rule ensembles. Ann. Appl. Stat. 916–954 (2008)
Google Scholar
Gebru, T., Krause, J., Wang, Y., Chen, D., Deng, J., Aiden, E.L., Fei-Fei, L.: Using deep learning and google street view to estimate the demographic makeup of neighborhoods across the united states. Proc. Natl. Acad. Sci. 201700035 (2017)
Google Scholar
George, G., Osinga, E.C., Lavie, D., Scott, B.A.: From the editors: big data and data science methods for management research. Acad. Manag. J. 59(5), 1493–1507 (2016)
Article Google Scholar
Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)
Article Google Scholar
Glorot, X., Bordes, A., and Bengio, Y.: Domain adaptation for large-scale sentiment classification: A deep learning approach. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 513–520 (2011)
Google Scholar
Guzman, J., Stern, S.: Where is silicon valley? Science 347(6222), 606–609 (2015)
Article Google Scholar
Guzman, J., Stern, S.: Nowcasting and placecasting entrepreneurial quality and performance. In: Haltiwanger, J., Hurst, E., Miranda, J., Schoar, A. (eds.) Measuring Entrepreneurial Businesses: Current Knowledge and Challenges, Chapter 2. University of Chicago Press (2017)
Google Scholar
Hagedoorn, J., Schakenraad, J.: A comparison of private and subsidized r&d partnerships in the European information technology industry. JCMS J. Common Market Stud. 31(3), 373–390 (1993)
Google Scholar
Hain, D.S., Jurowetzki, R.: The potentials of machine learning and big data in entrepreneurship research-the liaison of econometrics and data science. In: Cowling, M., Saridakis, G. (eds.) Handbook of Quantitative Research Methods in Entrepreneurship. Edward Elgar Publishing (2020)
Google Scholar
Hain, D.S., Jurowetzki, R., Buchmann, T., Wolf, P.: A text-embedding-based approach to measuring patent-to-patent technological similarity. Technol. Forecast. Soc. Change 177, 121559 (2022). https://doi.org/10.1016/j.techfore.2022.121559
Hall, B.H., Harhoff, D.: Recent research on the economics of patents. Annu. Rev. Econ. 4(1), 541–565 (2012)
Article Google Scholar
Harhoff, D., Scherer, F.M., Vopel, K.: Citations, family size, opposition and the value of patent rights. Res. Policy 32(8), 1343–1363 (2003)
Article Google Scholar
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Google Scholar
Hirschey, M., Richardson, V.J.: Are scientific indicators of patent quality useful to investors? J. Empir. Financ. 11(1), 91–107 (2004)
Article Google Scholar
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)
Article Google Scholar
Lerner, J.: The importance of patent scope: an empirical analysis. RAND J. Econ. 319–333 (1994)
Google Scholar
McAfee, A., Brynjolfsson, E., Davenport, T.H., et al.: Big data: the management revolution. Harv. Bus. Rev. 90(10), 60–68 (2012)
Google Scholar
McCulloch, W.S., Pitts, W.: A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5(4), 115–133 (1943)
Article MathSciNet Google Scholar
Mullainathan, S., Spiess, J.: Machine learning: an applied econometric approach. J. Econ. Perspect. 31(2), 87–106 (2017)
Article Google Scholar
Narin, F., Hamilton, K.S., Olivastro, D.: The increasing linkage between US technology and public science. Res. Policy 26(3), 317–330 (1997). https://doi.org/10.1016/S0048-7333(97)00013-9
Article Google Scholar
Ng, A.Y.: Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In: Proceedings of the Twenty-first International Conference on Machine Learning, pp. 78. ACM (2004)
Google Scholar
Perlich, C., Provost, F., Simonoff, J.S.: Tree induction vs. logistic regression: a learning-curve analysis. J. Mach. Learn. Res. 4(Jun), 211–255 (2003)
Google Scholar
Pillonetto, G., Dinuzzo, F., Chen, T., De Nicolao, G., Ljung, L.: Kernel methods in system identification, machine learning and function estimation: a survey. Automatica 50(3), 657–682 (2014). cited By 115
Google Scholar
Ribeiro, M.T., Singh, S., Guestrin, C.: Why should i trust you?: Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM (2016)
Google Scholar
Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65(6), 386 (1958)
Article Google Scholar
Sakurada, M., Yairi, T.: Anomaly detection using autoencoders with nonlinear dimensionality reduction. In: Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis, pp. 4. ACM (2014)
Google Scholar
Sedhain, S., Menon, A.K., Sanner, S., Xie, L.: Autorec: Autoencoders meet collaborative filtering. In: Proceedings of the 24th International Conference on World Wide Web, pp. 111–112. ACM (2015)
Google Scholar
Shane, S.: Technological opportunities and new firm creation. Manag. Sci. 47(2), 205–220 (2001)
Article Google Scholar
Shyu, M.-L., Chen, S.-C., Sarinnapakorn, K., Chang, L.: A novel anomaly detection scheme based on principal component classifier. Technical report, Miami University Coral Gables FL Department of Electrical and Computer Engineering (2003)
Google Scholar
Squicciarini, M., Dernis, H., Criscuolo, C.: Measuring patent quality, Indicators of technological and economic value (2013)
Google Scholar
Strobl, C., Malley, J., Tutz, G.: An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychol. Methods 14(4), 323 (2009)
Article Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Google Scholar
Taleb, N.: The black swan: The impact of the highly improbable. Random House Trade Paperbacks (2010)
Google Scholar
Therneau, T.M., Atkinson, E.J., et al.: An introduction to recursive partitioning using the rpart routines (1997)
Google Scholar
Tian, F., Gao, B., Cui, Q., Chen, E., Liu, T.-Y.: Learning deep representations for graph clustering. In: AAAI, pp. 1293–1299 (2014)
Google Scholar
Trajtenberg, M., Henderson, R., Jaffe, A.: University versus corporate patents: a window on the basicness of invention. Econ. Innov. New Technol. 5(1), 19–50 (1997)
Article Google Scholar
van der Vegt, G.S., Essens, P., Wahlström, M., George, G.: Managing risk and resilience. Acad. Manag. J. 58(4), 971–980 (2015)
Article Google Scholar
Varian, H.R.: Big data: New tricks for econometrics. J. Econ. Perspect. 28(2), 3–27 (2014)
Article Google Scholar
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11(Dec), 3371–3408 (2010)
Google Scholar
Wainwright, M.: Structured regularizers for high-dimensional problems: Statistical and computational issues. Ann. Rev. Stat. Appl. 1, 233–253 (2014). cited By 24
Google Scholar
Wang, Y.: A multinomial logistic regression modeling approach for anomaly intrusion detection. Comput. Secur. 24(8), 662–674 (2005)
Article Google Scholar
Zhou, M., Lang, S.-D.: Mining frequency content of network traffic for intrusion detection. In: Proceedings of the IASTED International Conference on Communication, Network, and Information Security, pp. 101–107 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Aalborg University Business School, Aalborg, 9220, Denmark
Daniel Hain & Roman Jurowetzki

Authors

Daniel Hain
View author publications
You can also search for this author in PubMed Google Scholar
Roman Jurowetzki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel Hain .

Editor information

Editors and Affiliations

Banking University of HCMC, Ho Chi Minh City, Vietnam
Nguyen Ngoc Thach
Department of Computer Science, University of Texas at El Paso, El Paso, TX, USA
Vladik Kreinovich
Banking University of HCMC, Ho Chi Minh City, Vietnam
Doan Thanh Ha
Banking University of HCMC, Ho Chi Minh City, Vietnam
Nguyen Duc Trung

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hain, D., Jurowetzki, R. (2022). Introduction to Rare-Event Predictive Modeling for Inferential Statisticians—A Hands-On Application in the Prediction of Breakthrough Patents. In: Ngoc Thach, N., Kreinovich, V., Ha, D.T., Trung, N.D. (eds) Financial Econometrics: Bayesian Analysis, Quantum Uncertainty, and Related Topics. ECONVN 2022. Studies in Systems, Decision and Control, vol 427. Springer, Cham. https://doi.org/10.1007/978-3-030-98689-6_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-98689-6_5
Published: 29 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-98688-9
Online ISBN: 978-3-030-98689-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics