Skip to main content

Talent Spotting in Crowd Prediction

  • Chapter
  • First Online:
Judgment in Predictive Analytics

Abstract

Who is good at prediction? Addressing this question is key to recruiting and cultivating accurate crowds and effectively aggregating their judgments. Recent research on superforecasting has demonstrated the importance of individual, persistent skill in crowd prediction. This chapter takes stock of skill identification measures in probability estimation tasks, and complements the review with original analyses, comparing such measures directly within the same dataset. We classify all measures in five broad categories: (1) accuracy-related measures, such as proper scores, model-based estimates of accuracy and excess volatility scores; (2) intersubjective measures, including proxy, surrogate and similarity scores; (3) forecasting behaviors, including activity, belief updating, extremity, coherence, and linguistic properties of rationales; (4) dispositional measures of fluid intelligence, cognitive reflection, numeracy, personality and thinking styles; and (5) measures of expertise, including demonstrated knowledge, confidence calibration, biographical, and self-rated expertise. Among non-accuracy-related measures, we report a median correlation coefficient with outcomes of r = 0.20. In the absence of accuracy data, we find that intersubjective and behavioral measures are most strongly correlated with forecasting accuracy. These results hold in a LASSO machine-learning model with automated variable selection. Two focal applications provide context for these assessments: long-term, existential risk prediction and corporate forecasting tournaments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    We refer to measures correlating with skill as predictors or correlates. To avoid confusion, we refer to individuals engaged in forecasting tasks as forecasters.

  2. 2.

    Normalization doesn’t account for question difficulties on its own, just transforms the distribution. So, when used as criterion variables, normalized scores are then standardized: \( {SMNB}_{f,q}=\frac{MNB_{f,q}-{\overline{MNB}}_q}{SD\left({MNB}_{f,q}\right)} \).

  3. 3.

    The authors were members of the SAGE team in the Hybrid Forecasting Competition. Linguistic properties of rationales were among the features used in aggregation weighting algorithms. The SAGE team the achieved highest accuracy in 2020, the last season of the tournament.

  4. 4.

    We do not offer complete coverage of intersubjective measures, including surrogate scores and similarity measures, but given our current results, further empirical investigation seems worthwhile.

  5. 5.

    We have notified Epstein of this. As a result, he shared plans to edit the sentence in future editions of Range.

  6. 6.

    Readers who have been exposed to research on forecaster skill identification through general media or popular science outlets may find some of our findings surprising. For example, a recent admittedly non-scientific poll of 30 twitter users by one of us (Atanasov) revealed that the plurality (40%) of respondents thought active open mindedness was more strongly correlated with accuracy than update magnitude, fluid intelligence or subject matter knowledge scores. Fewer than 20% correctly guessed that the closest correlate of accuracy was update magnitude.

References

  • Arthur, W., Jr., Tubre, T. C., Paul, D. S., & Sanchez-Ku, M. L. (1999). College-sample psychometric and normative data on a short form of the raven advanced progressive matrices test. Journal of Psychoeducational Assessment, 17(4), 354–361.

    Google Scholar 

  • Aspinall, W. (2010). A route to more tractable expert advice. Nature, 463(7279), 294–295.

    Google Scholar 

  • Atanasov, P., Rescober, P., Stone, E., Servan-Schreiber, E., Tetlock, P., Ungar, L., & Mellers, B. (2017). Distilling the wisdom of crowds: Prediction markets vs. prediction polls. Management Science, 63(3), 691–706.

    Google Scholar 

  • Atanasov, P., Diamantaras, A., MacPherson, A., Vinarov, E., Benjamin, D. M., Shrier, I., Paul, F., Dirnagl, U., & Kimmelman, J. (2020a). Wisdom of the expert crowd prediction of response for 3 neurology randomized trials. Neurology, 95(5), e488–e498.

    Google Scholar 

  • Atanasov, P., Witkowski, J., Ungar, L., Mellers, B., & Tetlock, P. (2020b). Small steps to accuracy: Incremental belief updaters are better forecasters. Organizational Behavior and Human Decision Processes, 160, 19–35.

    Google Scholar 

  • Atanasov, P., Joseph, R., Feijoo, F., Marshall, M., & Siddiqui, S. (2022a). Human forest vs. random forest in time-sensitive Covid-19 clinical trial prediction. Working Paper.

    Google Scholar 

  • Atanasov, P., Witkowski, J., Mellers, B., & Tetlock, P. (2022b) Crowdsourced prediction systems: Markets, polls, and elite forecasters. Working Paper.

    Google Scholar 

  • Augenblick, N., & Rabin, M. (2021). Belief movement, uncertainty reduction, and rational updating. The Quarterly Journal of Economics, 136(2), 933–985.

    Google Scholar 

  • Bandalos, D. L. (2018). Measurement theory and applications for the social sciences. Guilford Publications.

    Google Scholar 

  • Baron, J. (2000). Thinking and deciding. Cambridge University Press.

    Google Scholar 

  • Baron, J., Scott, S., Fincher, K., & Metz, S. E. (2015). Why does the cognitive reflection test (sometimes) predict utilitarian moral judgment (and other things)? Journal of Applied Research in Memory and Cognition, 4(3), 265–284.

    Google Scholar 

  • Barrick, M. R., & Mount, M. K. (1991). The big five personality dimensions and job performance: A meta-analysis. Personnel Psychology, 44(1), 1–26.

    Google Scholar 

  • Beard, S., Rowe, T., & Fox, J. (2020). An analysis and evaluation of methods currently used to quantify the likelihood of existential hazards. Futures, 115, 102469.

    Google Scholar 

  • Benjamin, D., Mandel, D. R., & Kimmelman, J. (2017). Can cancer researchers accurately judge whether preclinical reports will reproduce? PLoS Biology, 15(6), e2002212.

    Google Scholar 

  • Bennett, S., & Steyvers, M. (2022). Leveraging metacognitive ability to improve crowd accuracy via impossible questions. Decision, 9(1), 60–73.

    Google Scholar 

  • Bland, J. M., & Altman, D. G. (2011). Correlation in restricted ranges of data. BMJ: British Medical Journal, 342.

    Google Scholar 

  • Blattberg, R. C., & Hoch, S. J. (1990). Database models and managerial intuition: 50% model + 50% manager. Management Science, 36(8), 887–1009.

    Google Scholar 

  • Bo, Y. E., Budescu, D. V., Lewis, C., Tetlock, P. E., & Mellers, B. (2017). An IRT forecasting model: Linking proper scoring rules to item response theory. Judgment & Decision Making, 12(2), 90–103.

    Google Scholar 

  • Bors, D. A., & Stokes, T. L. (1998). Raven’s advanced progressive matrices: Norms for first-year university students and the development of a short form. Educational and Psychological Measurement, 58(3), 382–398.

    Google Scholar 

  • Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1–3.

    Google Scholar 

  • Broomell, S. B., & Budescu, D. V. (2009). Why are experts correlated? Decomposing correlations between judges. Psychometrika, 74(3), 531–553.

    Google Scholar 

  • Bruine de Bruin, W., Parker, A. M., & Fischhoff, B. (2007). Individual differences in adult decision-making competence. Journal of Personality and Social Psychology, 92(5), 938–956.

    Google Scholar 

  • Budescu, D. V., Weinberg, S., & Wallsten, T. S. (1988). Decisions based on numerically and verbally expressed uncertainties. Journal of Experimental Psychology: Human Perception and Performance, 14(2), 281–294.

    Google Scholar 

  • Budescu, D. V., & Chen, E. (2015). Identifying expertise to extract the wisdom of crowds. Management Science, 61(2), 267–280.

    Google Scholar 

  • Budescu, D.V., Himmelstein, M & Ho, E. (2021, October) Boosting the wisdom of crowds with social forecasts and coherence measures. In Presented at annual meeting of Society of Multivariate Experimental Psychology (SMEP).

    Google Scholar 

  • Burgman, M. A., McBride, M., Ashton, R., Speirs-Bridge, A., Flander, L., Wintle, B., Fider, F., Rumpff, L., & Twardy, C. (2011). Expert status and performance. PLoS One, 6(7), e22998.

    Google Scholar 

  • Cacioppo, J. T., & Petty, R. E. (1982). The need for cognition. Journal of Personality and Social Psychology, 42(1), 116–131.

    Google Scholar 

  • Chang, W., Atanasov, P., Patil, S., Mellers, B., & Tetlock, P. (2017). Accountability and adaptive performance: The long-term view. Judgment and Decision making, 12(6), 610–626.

    Google Scholar 

  • Chen, E., Budescu, D. V., Lakshmikanth, S. K., Mellers, B. A., & Tetlock, P. E. (2016). Validating the contribution-weighted model: Robustness and cost-benefit analyses. Decision Analysis, 13(2), 128–152.

    Google Scholar 

  • Cokely, E. T., Galesic, M., Schulz, E., Ghazal, S., & Garcia-Retamero, R. (2012). Measuring risk literacy: The Berlin numeracy test. Judgment and Decision making, 7(1), 25–47.

    Google Scholar 

  • Collins, R. N., Mandel, D. R., Karvetski, C. W., Wu, C. M., & Nelson, J. D. (2021). The wisdom of the coherent: Improving correspondence with coherence-weighted aggregation. Preprint available at PsyArXiv. Retrieved from https://psyarxiv.com/fmnty/

  • Collins, R., Mandel, D., & Budescu, D. (2022). Performance-weighted aggregation: Ferreting out wisdom within the crowd. In M. Seifert (Ed.), Judgment in predictive analytics. Springer [Reference to be updated with page numbers].

    Google Scholar 

  • Cooke, R. (1991). Experts in uncertainty: Opinion and subjective probability in science. Oxford University Press.

    Google Scholar 

  • Costa, P. T., Jr., & McCrae, R. R. (2008). The revised neo personality inventory (NEO-PI-R). Sage.

    Google Scholar 

  • Cowgill, B., & Zitzewitz, E. (2015). Corporate prediction markets: Evidence from Google, Ford, and Firm X. The Review of Economic Studies, 82(4), 1309–1341.

    Google Scholar 

  • Dana, J., Atanasov, P., Tetlock, P., & Mellers, B. (2019). Are markets more accurate than polls? The surprising informational value of “just asking”. Judgment and Decision making, 14(2), 135–147.

    Google Scholar 

  • Davis-Stober, C. P., Budescu, D. V., Dana, J., & Broomell, S. B. (2014). When is a crowd wise? Decision, 1(2), 79–101.

    Google Scholar 

  • Dieckmann, N. F., Gregory, R., Peters, E., & Hartman, R. (2017). Seeing what you want to see: How imprecise uncertainty ranges enhance motivated reasoning. Risk Analysis, 37(3), 471–486.

    Google Scholar 

  • Embretson, S. E., & Reise, S. P. (2013). Item response theory. Psychology Press.

    Google Scholar 

  • Epstein, D. (2019). Range: How generalists triumph in a specialized world. Pan Macmillan.

    Google Scholar 

  • Fan, Y., Budescu, D. V., Mandel, D., & Himmelstein, M. (2019). Improving accuracy by coherence weighting of direct and ratio probability judgments. Decision Analysis, 16, 197–217.

    Google Scholar 

  • Frederick, S. (2005). Cognitive reflection and decision making. Journal of Economic Perspectives, 19(4), 25–42.

    Google Scholar 

  • Galton, F. (1907). Vox populi (the wisdom of crowds). Nature, 75(7), 450–451.

    Google Scholar 

  • Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477), 359–378.

    Google Scholar 

  • Goldstein, D. G., McAfee, R. P., & Suri, S. (2014, June). The wisdom of smaller, smarter crowds. In Proceedings of the Fifteenth ACM Conference on Economics and Computation (pp. 471–488).

    Google Scholar 

  • Good, I. J. (1952). Rational decisions. Journal of the Royal Statistical Society. Series B (Methodological), 1952, 107–114.

    Google Scholar 

  • Hanea, A. D., Wilkinson, D., McBride, M., Lyon, A., van Ravenzwaaij, D., Singleton Thorn, F., Gray, C., Mandel, D. R., Willcox, A., Gould, E., Smith, E., Mody, F., Bush, M., Fidler, F., Fraser, H., & Wintle, B. (2021). Mathematically aggregating experts’ predictions of possible futures. PLoS One, 16(9), e0256919. https://doi.org/10.1371/journal.pone.0256919

    Article  Google Scholar 

  • Haran, U., Ritov, I., & Mellers, B. A. (2013). The role of actively open-minded thinking in information acquisition, accuracy, and calibration. Judgment and Decision making, 8(3), 188–201.

    Google Scholar 

  • Hastie, T., Qian, J., & Tay, K. (2021). An introduction to glmnet. CRAN R Repository.

    Google Scholar 

  • Himmelstein, M., Atanasov, P., & Budescu, D. V. (2021). Forecasting forecaster accuracy: Contributions of past performance and individual differences. Judgment & Decision Making, 16(2), 323–362.

    Google Scholar 

  • Himmelstein, M., Budescu, D. V., & Han, Y. (2023a). The wisdom of timely crowds. In M. Seifert (Ed.), Judgment in predictive analytics. Springer.

    Google Scholar 

  • Himmelstein, M., Budescu, D. V., & Ho, E. (2023b). The wisdom of many in few: Finding individuals who are as wise as the crowd. Journal of Experimental Psychology: General. Advance online publication.

    Google Scholar 

  • Ho, E. H. (2020, June). Developing and validating a method of coherence-based judgment aggregation. Unpublished PhD Dissertation. Fordham University, Bronx NY.

    Google Scholar 

  • Horowitz, M., Stewart, B. M., Tingley, D., Bishop, M., Resnick Samotin, L., Roberts, M., Chang, W., Mellers, B., & Tetlock, P. (2019). What makes foreign policy teams tick: Explaining variation in group performance at geopolitical forecasting. The Journal of Politics, 81(4), 1388–1404.

    Google Scholar 

  • Joseph, R., & Atanasov, P. (2019). Predictive training and accuracy: Self-selection and causal factors. Working Paper, Presented at Collective Intelligence 2019.

    Google Scholar 

  • Karger, E., Monrad, J., Mellers, B., & Tetlock, P. (2021). Reciprocal scoring: A method for forecasting unanswerable questions. Retrieved from SSRN

    Google Scholar 

  • Karger, J., Atanasov, P., & Tetlock, P. (2022). Improving judgments of existential risk: Better forecasts, questions, explanations, policies. SSRN Working Paper.

    Google Scholar 

  • Karvetski, C. W., Olson, K. C., Mandel, D. R., & Twardy, C. R. (2013). Probabilistic coherence weighting for optimizing expert forecasts. Decision Analysis, 10(4), 305–326.

    Google Scholar 

  • Karvetski, C. W., Meinel, C., Maxwell, D. T., Lu, Y., Mellers, B. A., & Tetlock, P. E. (2021). What do forecasting rationales reveal about thinking patterns of top geopolitical forecasters? International Journal of Forecasting, 38(2), 688–704.

    Google Scholar 

  • Kurvers, R. H., Herzog, S. M., Hertwig, R., Krause, J., Moussaid, M., Argenziano, G., Zalaudek, I., Carney, P. A., & Wolf, M. (2019). How to detect high-performing individuals and groups: Decision similarity predicts accuracy. Science Advances, 5(11), eaaw9011.

    Google Scholar 

  • Lipkus, I. M., Samsa, G., & Rimer, B. K. (2001). General performance on a numeracy scale among highly educated samples. Medical Decision Making, 21(1), 37–44.

    Google Scholar 

  • Liu, Y., Wang, J., & Chen, Y. (2020, July). Surrogate scoring rules. In Proceedings of the 21st ACM Conference on Economics and Computation (pp. 853–871).

    Google Scholar 

  • Mannes, A. E., Soll, J. B., & Larrick, R. P. (2014). The wisdom of select crowds. Journal of Personality and Social Psychology, 107(2), 276.

    Google Scholar 

  • Matzen, L. E., Benz, Z. O., Dixon, K. R., Posey, J., Kroger, J. K., & Speed, A. E. (2010). Recreating Raven’s: Software for systematically generating large numbers of Raven-like matrix problems with normed properties. Behavior Research Methods, 42(2), 525–541.

    Google Scholar 

  • Mauksch, S., Heiko, A., & Gordon, T. J. (2020). Who is an expert for foresight? A review of identification methods. Technological Forecasting and Social Change, 154, 119982.

    Google Scholar 

  • McAndrew, T., Cambeiro, J., & Besiroglu, T. (2022). Aggregating human judgment probabilistic predictions of the safety, efficacy, and timing of a COVID-19 vaccine. Vaccine, 40(15), 2331–2341.

    Google Scholar 

  • Mellers, B., Ungar, L., Baron, J., Ramos, J., Gurcay, B., Fincher, K., Scott, S. E., Moore, D., Atanasov, P., Swift, S. A., Murray, T., Stone, E., & Tetlock, P. E. (2014). Psychological strategies for winning a geopolitical forecasting tournament. Psychological Science, 25(5), 1106–1115.

    Google Scholar 

  • Mellers, B., Stone, E., Atanasov, P., Rohrbaugh, N., Metz, S. E., Ungar, L., Bishop, M. M., Horowitz, M., Merkle, E., & Tetlock, P. (2015a). The psychology of intelligence analysis: Drivers of prediction accuracy in world politics. Journal of Experimental Psychology: Applied, 21(1), 1.

    Google Scholar 

  • Mellers, B., Stone, E., Murray, T., Minster, A., Rohrbaugh, N., Bishop, M., Chen, E., Baker, J., Hou, Y., Horowitz, M., Ungar, L., & Tetlock, P. (2015b). Identifying and cultivating superforecasters as a method of improving probabilistic predictions. Perspectives on Psychological Science, 10(3), 267–281.

    Google Scholar 

  • Mellers, B. A., Baker, J. D., Chen, E., Mandel, D. R., & Tetlock, P. E. (2017). How generalizable is good judgment? A multitask, multi-benchmark study. Judgment and Decision making, 12(4), 369–381.

    Google Scholar 

  • Merkle, E. C., Steyvers, M., Mellers, B., & Tetlock, P. E. (2016). Item response models of probability judgments: Application to a geopolitical forecasting tournament. Decision, 3(1), 1–19.

    Google Scholar 

  • Milkman, K. L., Gandhi, L., Patel, M. S., Graci, H. N., Gromet, D. M., Ho, H., Kay, J. S., Lee, T. W., Rothschild, J., Bogard, J. E., Brody, I., Chabris, C. F., & Chang, E. (2022). A 680,000-person megastudy of nudges to encourage vaccination in pharmacies. Proceedings of the National Academy of Sciences, 119(6), e2115126119.

    Google Scholar 

  • Miller, N., Resnick, P., & Zeckhauser, R. (2005). Eliciting informative feedback: The peer-prediction method. Management Science, 51(9), 1359–1373.

    Google Scholar 

  • Morstatter, F., Galstyan, A., Satyukov, G., Benjamin, D., Abeliuk, A., Mirtaheri, M., et al. (2019). SAGE: A hybrid geopolitical event forecasting system. IJCAI, 1, 6557–6559.

    Google Scholar 

  • Murphy, A. H., & Winkler, R. L. (1987). A general framework for forecast verification. Monthly Weather Review, 115(7), 1330–1338.

    Google Scholar 

  • Palley, A. B., & Soll, J. B. (2019). Extracting the wisdom of crowds when information is shared. Management Science, 65(5), 2291–2309.

    Google Scholar 

  • Peters, E., Västfjäll, D., Slovic, P., Mertz, C. K., Mazzocco, K., & Dickert, S. (2006). Numeracy and decision making. Psychological Science, 17(5), 407–413.

    Google Scholar 

  • Predd, J. B., Osherson, D. N., Kulkarni, S. R., & Poor, H. V. (2008). Aggregating probabilistic forecasts from incoherent and abstaining experts. Decision Analysis, 5(4), 177–189.

    Google Scholar 

  • Prelec, D. (2004). A Bayesian truth serum for subjective data. Science, 306(5695), 462–466.

    Google Scholar 

  • Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph, 34, 1–97.

    Google Scholar 

  • Seifert, M., Siemsen, E., Hadida, A. L., & Eisingerich, A. B. (2015). Effective judgmental forecasting in the context of fashion products. Journal of Operations Management, 36, 33–45.

    Google Scholar 

  • Sell, T. K., Warmbrod, K. L., Watson, C., Trotochaud, M., Martin, E., Ravi, S. J., Balick, M., & Servan-Schreiber, E. (2021). Using prediction polling to harness collective intelligence for disease forecasting. BMC Public Health, 21(1), 1–9.

    Google Scholar 

  • Shipley, W. C., Gruber, C. P., Martin, T. A., & Klein, A. M. (2009). Shipley-2 manual. Western Psychological Services.

    Google Scholar 

  • Stanovich, K. E., & West, R. F. (1997). Reasoning independently of prior belief and individual differences in actively open-minded thinking. Journal of Educational Psychology, 89(2), 342–357.

    Google Scholar 

  • Stewart, T. R., Roebber, P. J., & Bosart, L. F. (1997). The importance of the task in analyzing expert judgment. Organizational Behavior and Human Decision Processes, 69(3), 205–219.

    Google Scholar 

  • Suedfeld, P., & Tetlock, P. (1977). Integrative complexity of communications in international crises. Journal of Conflict Resolution, 21(1), 169–184.

    Google Scholar 

  • Tannenbaum, D., Fox, C. R., & Ülkümen, G. (2017). Judgment extremity and accuracy under epistemic vs. aleatory uncertainty. Management Science, 63(2), 497–518.

    Google Scholar 

  • Tetlock, P. E. (2005). Expert political judgment. Princeton University Press.

    Google Scholar 

  • Tetlock, P. E., & Gardner, D. (2016). Superforecasting: The art and science of prediction. Random House.

    Google Scholar 

  • Toplak, M. E., West, R. F., & Stanovich, K. E. (2014). Assessing miserly information processing: An expansion of the cognitive reflection test. Thinking & Reasoning, 20(2), 147–168.

    Google Scholar 

  • Tsai, J., & Kirlik, A. (2012). Coherence and correspondence competence: Implications for elicitation and aggregation of probabilistic forecasts of world events. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting (Vol. 56, pp. 313–317). Sage.

    Google Scholar 

  • Wallsten, T. S., Budescu, D. V., & Zwick, R. (1993). Comparing the calibration and coherence of numerical and verbal probability judgments. Management Science, 39(2), 176–190.

    Google Scholar 

  • Webster, D. M., & Kruglanski, A. W. (1994). Individual differences in need for cognitive closure. Journal of Personality and Social Psychology, 67(6), 1049–1162.

    Google Scholar 

  • Witkowski, J., & Parkes, D. (2012). A robust bayesian truth serum for small populations. Proceedings of the AAAI Conference on Artificial Intelligence, 26(1), 1492–1498.

    Google Scholar 

  • Witkowski, J., Atanasov, P., Ungar, L., & Krause, A. (2017) Proper proxy scoring rules. In Presented at AAAI-17: Thirty-First AAAI Conference on Artificial Intelligence.

    Google Scholar 

  • Zong, S., Ritter, A., & Hovy, E. (2020). Measuring forecasting skill from text. arXiv preprint arXiv:2006.07425.

    Google Scholar 

Download references

Acknowledgments

We thank Matthias Seifert, David Budescu, David Mandel, Stefan Herzog and Philip Tetlock for helpful suggestions. All remaining errors are our own. No project-specific funding was used for the completion of this chapter.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pavel Atanasov .

Editor information

Editors and Affiliations

Appendix: Methodological Details of Selected Predictors

Appendix: Methodological Details of Selected Predictors

1.1 Item Response Theory Models

In forecasting, one such confounder is the timing in which forecasts are made. In forecasting tournaments, forecasters make many forecasts about the same problems at various time points. Those who forecast problems closer to their resolution date have an accuracy advantage which may be important to account for in assessing their talent level (for more detail, see Himmelstein et al., this issue). IRT models can be extended so that their diagnostic properties change relative to the time point at which a forecaster makes their forecast. One such model is given below (Himmelstein et al., 2021; Merkle et al., 2016).

$$ \boldsymbol{N}{\boldsymbol{B}}_{\boldsymbol{f},\boldsymbol{q},\boldsymbol{d}}={\boldsymbol{b}}_{\mathbf{0},\boldsymbol{q}}+\left({\boldsymbol{b}}_{\mathbf{1},\boldsymbol{q}}-{\boldsymbol{b}}_{\mathbf{0},\boldsymbol{q}}\right){\boldsymbol{e}}^{-{\boldsymbol{b}}_{\mathbf{2}}{\boldsymbol{t}}_{\boldsymbol{f},\boldsymbol{q},\boldsymbol{d}}}+{\boldsymbol{\lambda}}_{\boldsymbol{q}}{\boldsymbol{\theta}}_{\boldsymbol{f}}+{\boldsymbol{\epsilon}}_{\boldsymbol{f},\boldsymbol{q},\boldsymbol{d}} $$

The three b parameters represent how an item’s difficult changes as time passes: b0, q represents an item’s maximum difficulty (as time to resolution goes to infinity), b1, q an item’s minimum difficulty (immediately prior to resolution), and b2 the shape of the curve between b0, q and b1, q based on how much time is remaining in the question at the time of the forecast (tf, q, d). The other two parameters represent how well an item discriminates between forecasters of different skill levels (λq) and how skilled the individual forecasters are (θf). As the estimate of forecaster skill, talent spotters will typically be most interested in this θf parameter, which is conventionally scaled so that it is on a standard normal distribution, θf~N(0, 1), with scores of 0 indicating an average forecaster, −1 a forecaster that is 1 SD below average, and 1 a forecaster that is 1 SD above average.

One potential problem with this model is that, in some cases, the distribution of Brier Scores is not well behaved. This typically occurs in cases which have many binary questions, so that the Brier score is a direct function of the accuracy assigned to the correct option. In such cases, the distribution of Brier scores can be multi-modal, because forecasters will tend to input many extreme and round number probability estimates, such as 0, .5, and 1 (Bo et al., 2017; Budescu et al., 1988; Merkle et al., 2016; Wallsten et al., 1993). To accommodate such multi-modal distributions, one option is to discretize the distribution of Brier scores into bins and reconfigure the model into an ordinal response model. Such models, such as the graded response model (Samejima, 1969), have a long history in the IRT literature.

Merkle et al. (2016) and Bo et al. (2017) describe examples of ordinal IRT models for forecasting judgment. However, the former found that the continuous and ordinal versions of the model were highly correlated (r = .87) in their assessment of forecaster ability level, and that disagreements tended to be focused on poor performing forecasters (who tend to make large errors) than high performing forecasters.

1.2 Contribution Scores

To obtain contribution scores for individual forecasters, it is necessary to first define some aggregation method for all of their judgments for each question. The simplest, and most common form of aggregation would just be to obtain the mean of all probabilities for all events associated with a forecasting problem. The aggregate probability (AP) for each of the c events associated with a forecasting question across all forecasters would be

$$ \boldsymbol{A}{\boldsymbol{P}}_{\boldsymbol{q},\boldsymbol{c}}=\frac{\sum \limits_{\boldsymbol{f}=\mathbf{1}}^{\boldsymbol{F}}{\boldsymbol{p}}_{\boldsymbol{q},\boldsymbol{c},\boldsymbol{f}}}{\boldsymbol{F}} $$

And the aggregate Brier score (AB) would then be

$$ \boldsymbol{A}{\boldsymbol{B}}_{\boldsymbol{q}}=\sum \limits_{\boldsymbol{c}=\mathbf{1}}^{\boldsymbol{C}}{\left({\boldsymbol{AP}}_{\boldsymbol{q},\boldsymbol{c}}-{\boldsymbol{y}}_{\boldsymbol{q}}\right)}^{\mathbf{2}} $$

Based on this aggregation approach, defining the contribution of individual forecasters to the aggregate is algebraically straightforward. We can define APq, c, − f as the aggregate probability with an individual forecaster’s judgment removed as

$$ \boldsymbol{A}{\boldsymbol{P}}_{\boldsymbol{q},\boldsymbol{c},-\boldsymbol{f}}=\frac{\left(\boldsymbol{F}\right)\left(\boldsymbol{A}{\boldsymbol{P}}_{\boldsymbol{q},\boldsymbol{c}}\right)-{\boldsymbol{p}}_{\boldsymbol{q},\boldsymbol{c},\boldsymbol{f}}}{\boldsymbol{F}-\mathbf{1}} $$

And the aggregate Brier score with an individual forecaster’s judgment removed as

$$ A{B}_{q,-f}=\sum \limits_{c=1}^C{\left({AP}_{q,c,-f}-{y}_q\right)}^{2.} $$

Finally, we define a forecaster’s average contribution to the accuracy of the aggregate crowd forecasts as

$$ {\boldsymbol{C}}_{\boldsymbol{f}}=\frac{\sum \limits_{\boldsymbol{q}=\mathbf{1}}^{\boldsymbol{Q}}\boldsymbol{A}{\boldsymbol{B}}_{\boldsymbol{q}}-\boldsymbol{A}{\boldsymbol{B}}_{\boldsymbol{q},-\boldsymbol{f}}}{\boldsymbol{Q}} $$

Cf is a representation of how much information a forecaster brings to the table, on average, that is both unique and beneficial. It is possible that a forecaster ranked very highly on individual accuracy might be ranked lower in terms of their contribution, because their forecasts tended to be very similar to the forecasts of others, and so they did less to move the needle when averaged into the crowd.

Both weighting members of the crowd by average contribution scores, as well as selecting positive or high performing contributors, have been demonstrated to improve the aggregate crowd judgment (Budescu & Chen, 2015; Chen et al., 2016). The approach is especially appealing because it can be extended into a model that is dynamic, in that it is able to update contribution scores for each member of a crowd as more information about their performance comes available; it requires relatively little information about past performance to reliably estimate high performing contributors; and it is cost effective, in that is able to select a relatively small group of high performing contributors who can produce an aggregate judgment that matches or exceeds the judgment of larger crowds in terms of accuracy (Chen et al., 2016).

The advent of contribution assessment was initially designed with a particular goal in mind: to improve the aggregate wisdom of the crowd (Budescu & Chen, 2015; Chen et al., 2016). One might challenge as slightly as a slightly narrower goal than pure talent spotting. It is clearly an effective tool for maximizing crowd wisdom, but is it a valid tool for assessing expertise? The answer appears to be yes. Chen et al. (2016) not only studied contribution scores as an aggregation tool but tested how well contribution scores perform at selecting forecasters known to have a skill advantage through various manipulations known to benefit expertise, such as explicit training and interactive collaboration.

Table 6.6 Correlation matrix for measures in Study 2. Pearson correlation coefficients reported. Below-diagonal values are assessed in-sample, above diagonal values are calculated out-of-sample. Diagonal values highlighted in gray are cross-sample reliability coefficients. The bottom five measures are not question specific, so out-of-sample correlation coefficients or cross-sample reliability coefficients are not relevant

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Atanasov, P., Himmelstein, M. (2023). Talent Spotting in Crowd Prediction. In: Seifert, M. (eds) Judgment in Predictive Analytics. International Series in Operations Research & Management Science, vol 343. Springer, Cham. https://doi.org/10.1007/978-3-031-30085-1_6

Download citation

Publish with us

Policies and ethics