Skip to main content

Introduction to Rare-Event Predictive Modeling for Inferential Statisticians—A Hands-On Application in the Prediction of Breakthrough Patents

  • Conference paper
  • First Online:
Financial Econometrics: Bayesian Analysis, Quantum Uncertainty, and Related Topics (ECONVN 2022)

Part of the book series: Studies in Systems, Decision and Control ((SSDC,volume 427))

Included in the following conference series:

  • 854 Accesses

Abstract

Recent years have seen a substantial development of quantitative methods, mostly led by the computer science community with the goal to develop better machine learning application, mainly focused on predictive modeling. However, economic, management, and technology forecasting research has up to now been hesitant to apply predictive modeling techniques and workflows. In this paper, we introduce to a machine learning (ML) approach to quantitative analysis geared towards optimizing the predictive performance, contrasting it with standard practices inferential statistics which focus on producing good parameter estimates. We discuss the potential synergies between the two fields against the backdrop of this at first glance, target-incompatibility. We discuss fundamental concepts in predictive modeling, such as out-of-sample model validation, variable and model selection, generalization and hyperparameter tuning procedures. Providing a hands-on predictive modelling for an quantitative social science audience, while aiming at demystifying computer science jargon. We use the example of high-quality patent identification guiding the reader through various model classes and procedures for data preprocessing, modelling and validation. We start of with more familiar easy to interpret model classes (Logit and Elastic Nets), continues with less familiar non-parametric approaches (Classification Trees and Random Forest) and finally presents artificial neural network architectures, first a simple feed-forward and then a deep autoencoder geared towards anomaly detection. Instead of limiting ourselves to the introduction of standard ML techniques, we also present state-of-the-art yet approachable techniques from artificial neural networks and deep learning to predict rare phenomena of interest.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Often, the challenge in adapting ML techniques for social science problems can be attributed to two issues: (1) Technical lock-ins and (2) Mental lock-ins against the backdrop of paradigmatic contrasts between research traditions. For instance, many ML techniques are initially demonstrated at a collection of—in the ML and Computer Science—well known standard datasets with specific properties. For an applied statistician particularly in social science, however, the classification of Netflix movie ratings or the reconstruction of handwritten digits form the MNIST data-set may appear remote or trivial. These two problems are addressed by contrasting ML techniques with inferential statistics approaches, while using the non-trivial example of patent quality prediction which should be easy to comprehend for scholars working in social science disciplines such as economics.

  2. 2.

    We here blatantly draw from stereotypical workflows inherent to the econometrics and ML discipline. We apologize for offending whoever does not fit neatly in one of these categories.

  3. 3.

    At the point where our \(R^2\) exceeds a threshold somewhere around 0.1, we commonly stop worrying about it.

  4. 4.

    As the name already suggest, this simply expresses by how much our prediction is on average off: \({RMSE} ={\sqrt{\frac{\sum _{i=1}^{n}({\hat{y}}_{i}-y_{i})^{2}}{n}}}.\)

  5. 5.

    Interestingly, quite some techniques associated with identification strategies which are popular among econometricians, such as the use of instrumental variables, endogeneous selection models, fixed and random effect panel regressions, or vector autogregressions, are little known by the ML community.

  6. 6.

    Such k-fold cross-validations can be conveniently done in R with the caret, and in Python with the scikit-learn package.

  7. 7.

    However, one instantly recognizes the similarity to the nowadays common practice among econometricians to bootstrap standard errors by computing them over different subsets of data. The difference here is that we commonly use this procedure, (i) to get more robust parameter estimates instead of evaluating the model’s overall goodness-of-fit, and (ii) we compute them on subsets of the same data the model as fitted on.

  8. 8.

    For exhaustive surveys on regularization approaches in machine learning particularly focused on high-dimensional data, consider Wainwright (2014), Pillonetto et al. (2014).

  9. 9.

    Bootstrapping is a technique most applied econometricians are well-acquainted with, yes used for a slightly different purpose. In econometrics, bootstrapping represents a powerful way to circumvent problems arising out of selection bias and other sampling issues, where the regression on several subsamples is used to adjust the standard errors of the estimates.

  10. 10.

    This is often not the case for typical ML problems, drawing from large numbers of observations and/or a large set of variables. Here, distributed or cloud-based workflows become necessary. We discuss the arising challenges elsewhere (e.g., Hain and Jurowetzki 2020).

  11. 11.

    For a recent and exhaustive review on patent quality measures, including all used in this exercise, consider Squicciarini et al. (2013).

  12. 12.

    While the described process appears rather tedious by hand, specialized ML packages such as caret in R provide efficient workflows to automatize the creation of folds as well as hyperparamether grid search.

  13. 13.

    For an exhaustive overview on model and variable selection algorithms consider Castle et al. (2009).

  14. 14.

    For an exhaustive discussion on the use of LASSO, consider Belloni et al. (2014). Elastic nets are integrated, among others, in the R package Glmnet, and Python’sscikit-learn.

  15. 15.

    There are quite many packages dealing with different implementations of regression trees in common data science environments, such as rpart, tree, party for R, and again the machine learning allrounder scikit-learn in Python. For a more exhaustive introduction to CART models, consider Strobl et al. (2009).

  16. 16.

    Indeed, it is worth mentioning here that many model tuning techniques are based on the idea that adding randomness to the prediction process—somewhat counter-intuitively—increases the robustness and out-of-sample prediction performance of the model.

  17. 17.

    Just to give an example, Mullainathan and Spiess (2017) demonstrate how a LASSO might select very different features in every fold.

  18. 18.

    It has to be stressed that even though neural networks are indeed inspired by the most basic concept of how a brain works, they are by no means mysterious artificial brains. The analogy goes as far as the abstraction that a couple of neurons that are interconnected in some architecture. The neuron is represented as some sigmoid function (somewhat like a logistic regression) which decides based on the inputs received if it should get activated and send a signal to connected neurons, which might again trigger their activation. Having that said, calling a neural network an artificial brain is somewhat like calling a paper-plane an artificial bird.

  19. 19.

    for the sake of simplicity here we will not distinguish between the simple perceptron model, sigmoid neurons or the recently more commonly used rectified linear neurons (Glorot et al. 2011).

  20. 20.

    This complex algorithm adjusts simultaneously all weight in the network, considering the individual contribution of the neuron to the error.

  21. 21.

    For some overview on other methods using similar logic, consider: Wang (2005), Zhou and Lang (2003), Shyu et al. (2003).

  22. 22.

    Variational autoencoers are a slightly more modern and interesting take on this class of models which also performed well in our experiments. Following the KISS principle, we decided to use the more traditional and simpler autoencoder architecture that is easier to explain and performed almost equally well.

References

  • Aguinis, H., Pierce, C.A., Bosco, F.A., Muslin, I.S.: First decade of organizational research methods: trends in design, measurement, and data-analysis topics. Organ. Res. Methods 12(1), 69–112 (2009)

    Article  Google Scholar 

  • Ahuja, G., Lampert, C.: Entrepreneurship in the large corporation: a longitudinal study of how established firms create breakthrough inventions. Strateg. Manag. J. 22(6–7), 521–543 (2001)

    Article  Google Scholar 

  • An, J., Cho, S.: Variational autoencoder based anomaly detection using reconstruction probability. Rep, SNU Data Mining Center, Tech (2015)

    Google Scholar 

  • Andrews, R.J., Fazio, C., Guzman, J., Stern, S.: The startup cartography project: A map of entrepreneurial quality and quantity in the united states across time and location. MIT Working Paper (2017)

    Google Scholar 

  • Athey, S., Imbens, G.W.: The state of applied econometrics: causality and policy evaluation. J. Econ. Perspect. 31(2), 3–32 (2017)

    Article  Google Scholar 

  • Basberg, B.L.: Patents and the measurement of technological change: a survey of the literature. Res. Policy 16(2–4), 131–141 (1987)

    Article  Google Scholar 

  • Belloni, A., Chernozhukov, V., Hansen, C.: Inference on treatment effects after selection among high-dimensional controls. Rev. Econ. Stud. 81(2), 608–650 (2014)

    Article  MathSciNet  Google Scholar 

  • Castle, J.L., Qin, X., Reed, W.R., et al.: How to pick the best regression equation: A review and comparison of model selection algorithms. Working Paper No. 13/2009, Department of Economics and Finance, University of Canterbury (2009)

    Google Scholar 

  • Einav, L., Levin, J.: The data revolution and economic analysis. Innov. Policy Econ. 14(1), 1–24 (2014)

    Google Scholar 

  • Einav, L., Levin, J.: Economics in the age of big data. Science 346(6210), 1243089 (2014)

    Article  Google Scholar 

  • Ernst, H.: Patent applications and subsequent changes of performance: evidence from time-series cross-section analyses on the firm level. Res. Policy 30(1), 143–157 (2001)

    Article  Google Scholar 

  • Fazio, C., Guzman, J., Murray, F., and Stern, S.: A new view of the skew: quantitative assessment of the quality of American entrepreneurship. MIT Innovation Initiative Paper (2016)

    Google Scholar 

  • Friedman, J.H., Popescu, B.E.: Predictive learning via rule ensembles. Ann. Appl. Stat. 916–954 (2008)

    Google Scholar 

  • Gebru, T., Krause, J., Wang, Y., Chen, D., Deng, J., Aiden, E.L., Fei-Fei, L.: Using deep learning and google street view to estimate the demographic makeup of neighborhoods across the united states. Proc. Natl. Acad. Sci. 201700035 (2017)

    Google Scholar 

  • George, G., Osinga, E.C., Lavie, D., Scott, B.A.: From the editors: big data and data science methods for management research. Acad. Manag. J. 59(5), 1493–1507 (2016)

    Article  Google Scholar 

  • Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)

    Article  Google Scholar 

  • Glorot, X., Bordes, A., and Bengio, Y.: Domain adaptation for large-scale sentiment classification: A deep learning approach. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 513–520 (2011)

    Google Scholar 

  • Guzman, J., Stern, S.: Where is silicon valley? Science 347(6222), 606–609 (2015)

    Article  Google Scholar 

  • Guzman, J., Stern, S.: Nowcasting and placecasting entrepreneurial quality and performance. In: Haltiwanger, J., Hurst, E., Miranda, J., Schoar, A. (eds.) Measuring Entrepreneurial Businesses: Current Knowledge and Challenges, Chapter 2. University of Chicago Press (2017)

    Google Scholar 

  • Hagedoorn, J., Schakenraad, J.: A comparison of private and subsidized r&d partnerships in the European information technology industry. JCMS J. Common Market Stud. 31(3), 373–390 (1993)

    Google Scholar 

  • Hain, D.S., Jurowetzki, R.: The potentials of machine learning and big data in entrepreneurship research-the liaison of econometrics and data science. In: Cowling, M., Saridakis, G. (eds.) Handbook of Quantitative Research Methods in Entrepreneurship. Edward Elgar Publishing (2020)

    Google Scholar 

  • Hain, D.S., Jurowetzki, R., Buchmann, T., Wolf, P.: A text-embedding-based approach to measuring patent-to-patent technological similarity. Technol. Forecast. Soc. Change 177, 121559 (2022). https://doi.org/10.1016/j.techfore.2022.121559

  • Hall, B.H., Harhoff, D.: Recent research on the economics of patents. Annu. Rev. Econ. 4(1), 541–565 (2012)

    Article  Google Scholar 

  • Harhoff, D., Scherer, F.M., Vopel, K.: Citations, family size, opposition and the value of patent rights. Res. Policy 32(8), 1343–1363 (2003)

    Article  Google Scholar 

  • Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)

    Google Scholar 

  • Hirschey, M., Richardson, V.J.: Are scientific indicators of patent quality useful to investors? J. Empir. Financ. 11(1), 91–107 (2004)

    Article  Google Scholar 

  • LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)

    Article  Google Scholar 

  • Lerner, J.: The importance of patent scope: an empirical analysis. RAND J. Econ. 319–333 (1994)

    Google Scholar 

  • McAfee, A., Brynjolfsson, E., Davenport, T.H., et al.: Big data: the management revolution. Harv. Bus. Rev. 90(10), 60–68 (2012)

    Google Scholar 

  • McCulloch, W.S., Pitts, W.: A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5(4), 115–133 (1943)

    Article  MathSciNet  Google Scholar 

  • Mullainathan, S., Spiess, J.: Machine learning: an applied econometric approach. J. Econ. Perspect. 31(2), 87–106 (2017)

    Article  Google Scholar 

  • Narin, F., Hamilton, K.S., Olivastro, D.: The increasing linkage between US technology and public science. Res. Policy 26(3), 317–330 (1997). https://doi.org/10.1016/S0048-7333(97)00013-9

    Article  Google Scholar 

  • Ng, A.Y.: Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In: Proceedings of the Twenty-first International Conference on Machine Learning, pp. 78. ACM (2004)

    Google Scholar 

  • Perlich, C., Provost, F., Simonoff, J.S.: Tree induction vs. logistic regression: a learning-curve analysis. J. Mach. Learn. Res. 4(Jun), 211–255 (2003)

    Google Scholar 

  • Pillonetto, G., Dinuzzo, F., Chen, T., De Nicolao, G., Ljung, L.: Kernel methods in system identification, machine learning and function estimation: a survey. Automatica 50(3), 657–682 (2014). cited By 115

    Google Scholar 

  • Ribeiro, M.T., Singh, S., Guestrin, C.: Why should i trust you?: Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM (2016)

    Google Scholar 

  • Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65(6), 386 (1958)

    Article  Google Scholar 

  • Sakurada, M., Yairi, T.: Anomaly detection using autoencoders with nonlinear dimensionality reduction. In: Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis, pp. 4. ACM (2014)

    Google Scholar 

  • Sedhain, S., Menon, A.K., Sanner, S., Xie, L.: Autorec: Autoencoders meet collaborative filtering. In: Proceedings of the 24th International Conference on World Wide Web, pp. 111–112. ACM (2015)

    Google Scholar 

  • Shane, S.: Technological opportunities and new firm creation. Manag. Sci. 47(2), 205–220 (2001)

    Article  Google Scholar 

  • Shyu, M.-L., Chen, S.-C., Sarinnapakorn, K., Chang, L.: A novel anomaly detection scheme based on principal component classifier. Technical report, Miami University Coral Gables FL Department of Electrical and Computer Engineering (2003)

    Google Scholar 

  • Squicciarini, M., Dernis, H., Criscuolo, C.: Measuring patent quality, Indicators of technological and economic value (2013)

    Google Scholar 

  • Strobl, C., Malley, J., Tutz, G.: An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychol. Methods 14(4), 323 (2009)

    Article  Google Scholar 

  • Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)

    Google Scholar 

  • Taleb, N.: The black swan: The impact of the highly improbable. Random House Trade Paperbacks (2010)

    Google Scholar 

  • Therneau, T.M., Atkinson, E.J., et al.: An introduction to recursive partitioning using the rpart routines (1997)

    Google Scholar 

  • Tian, F., Gao, B., Cui, Q., Chen, E., Liu, T.-Y.: Learning deep representations for graph clustering. In: AAAI, pp. 1293–1299 (2014)

    Google Scholar 

  • Trajtenberg, M., Henderson, R., Jaffe, A.: University versus corporate patents: a window on the basicness of invention. Econ. Innov. New Technol. 5(1), 19–50 (1997)

    Article  Google Scholar 

  • van der Vegt, G.S., Essens, P., Wahlström, M., George, G.: Managing risk and resilience. Acad. Manag. J. 58(4), 971–980 (2015)

    Article  Google Scholar 

  • Varian, H.R.: Big data: New tricks for econometrics. J. Econ. Perspect. 28(2), 3–27 (2014)

    Article  Google Scholar 

  • Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11(Dec), 3371–3408 (2010)

    Google Scholar 

  • Wainwright, M.: Structured regularizers for high-dimensional problems: Statistical and computational issues. Ann. Rev. Stat. Appl. 1, 233–253 (2014). cited By 24

    Google Scholar 

  • Wang, Y.: A multinomial logistic regression modeling approach for anomaly intrusion detection. Comput. Secur. 24(8), 662–674 (2005)

    Article  Google Scholar 

  • Zhou, M., Lang, S.-D.: Mining frequency content of network traffic for intrusion detection. In: Proceedings of the IASTED International Conference on Communication, Network, and Information Security, pp. 101–107 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel Hain .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hain, D., Jurowetzki, R. (2022). Introduction to Rare-Event Predictive Modeling for Inferential Statisticians—A Hands-On Application in the Prediction of Breakthrough Patents. In: Ngoc Thach, N., Kreinovich, V., Ha, D.T., Trung, N.D. (eds) Financial Econometrics: Bayesian Analysis, Quantum Uncertainty, and Related Topics. ECONVN 2022. Studies in Systems, Decision and Control, vol 427. Springer, Cham. https://doi.org/10.1007/978-3-030-98689-6_5

Download citation

Publish with us

Policies and ethics