Learning from automatically labeled data: case study on click fraud prediction

Abstract

In the era of big data, both class labels and covariates may result from proprietary algorithms or ground models. The predictions of these ground models, however, are not the same as the unknown ground truth. Thus, the automatically generated class labels are inherently uncertain, making subsequent supervised learning from such data a challenging task. Fine-tuning a new classifier could mean that, at the extreme, this new classifier will try to replicate the decision heuristics of the ground model. However, few new insights can be expected from a model that tries to merely emulate another one. Here, we study this problem in the context of click fraud prediction from highly skewed data that were automatically labeled by a proprietary detection algorithm. We propose a new approach to generate click profiles for publishers of online advertisements. In a blinded test, our ensemble of random forests achieved an average precision of only 36.2 %, meaning that our predictions do not agree very well with those of the ground model. We tried to elucidate this discrepancy and made several interesting observations. Our results suggest that supervised learning from automatically labeled data should be complemented by an interpretation of conflicting predictions between the new classifier and the ground model. If the ground truth is not known, then elucidating such disagreements might be more relevant than improving the performance of the new classifier.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Notes

  1. 1.

    International Workshop on Fraud Detection in Mobile Advertising (FDMA 2012), in conjunction with the 4th Asian Conference on Machine Learning, 4 November 2012, Singapore; http://palanteer.sis.smu.edu.sg/fdma2012/.

  2. 2.

    The common tenfold cross-validation is not advisable in this setting because too few cases of the minority class would be selected for each validation set.

    Fig. 2
    figure2

    Fourfold stratified cross-validation to build an ensemble of random forests for blinded testing

  3. 3.

    This information is not shown in Fig. 4 but can be easily verified from the raw data.

    Fig. 4
    figure4

    a The top five publishers and b the bottom five fraudulent publishers from Fig. 3, shown in close-up (only the first eight intervals are shown). The provided status is plausible for cases #4 and #5; for the remaining eight cases, the status is questionable

References

  1. 1.

    Berrar D (2012) Random forests for the detection of click fraud in online mobile advertising. In: Proceedings of the 1st International Workshop on Fraud Detection in Mobile Advertising, pp. 1–10

  2. 2.

    Berrar D, Lozano J (2013) Significance tests or confidence intervals: which are preferable for the comparison of classifiers? J Exp Theor Artif Intell 25(2):189–206

    Article  Google Scholar 

  3. 3.

    Bootkrajang J, Kabán A (2013) Boosting in the presence of label noise. In: Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence, pp. 82–90

  4. 4.

    Bouveyron C, Girard S (2009) Robust supervised classification with mixture models: learning from data with uncertain labels. Pattern Recognit 42:2649–2658

    Article  MATH  Google Scholar 

  5. 5.

    Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MATH  Google Scholar 

  6. 6.

    Brodley C, Friedl M (1999) Identifying mislabeled training data. J Artif Intell Res 11:131–167

    MATH  Google Scholar 

  7. 7.

    Chen C, Liaw A, Breiman L (2004) Using random forests to learn imbalanced data, Technical report #666. Department of Statistics, University of California, Berkeley, pp. 1–12

  8. 8.

    Dave V, Guha S, Zhang Y (2012) Measuring and fingerprinting click-spam in ad networks. ACM SIGCOMM Comput Commun Rev 42(3):175–186

    Article  Google Scholar 

  9. 9.

    Drummond C, Japkowicz N (2010) Warning: statistical benchmarking is addictive. kicking the habit in machine learning. J Exp Theor Artif Intell 2:67–80

    Article  Google Scholar 

  10. 10.

    Hand D (2006) Classifier technology and the illusion of progress. Stat Sci 21(1):1–14

    MathSciNet  Article  MATH  Google Scholar 

  11. 11.

    Immorlica N, Jain K, Mahdian M, Talwar K (2005) Click fraud resistant methods for learning click-through rates. In: Proceedings of the 1st Workshop on Internet and Network Economics, pp. 34–45

  12. 12.

    Lamiroy B, Sun T (2013) Computing precision and recall with missing or uncertain ground truth, graphics recognition. New trends and challenges. 9th international workshop, GREC 2011, Seoul, Korea, September 15–16, 2011. Revised selected papers, Springer Lecture Notes in Computer Science, pp 149–162

  13. 13.

    Langley P (2011) The changing science of machine learning. Mach Learn 82:275–279

    MathSciNet  Article  MATH  Google Scholar 

  14. 14.

    Oentaryo R, Lim E, Finegold M, Lo D, Zhu F, Phua C, Cheu E, Yap G, Sim K, Nguyen M, Perera K, Neupane B, Faisal M, Aung Z, Woon W, Chen W, Patel D, Berrar D (2014) Detecting click fraud in online advertising: a data mining approach. J Mach Learn Res 14:99–140

    Google Scholar 

  15. 15.

    Quost B, Denœux T (2009) Learning from data with uncertain labels by boosting credal classifiers. In: Proceedings of the 1st ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data, pp. 38–47

  16. 16.

    R Development Core Team (2009) R: a Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna ISBN 3-900051-07-0

  17. 17.

    Rosenthal R (1979) The file drawer problem and tolerance for null results. Psychol Bull 86(3):638–641

    Article  Google Scholar 

  18. 18.

    Wagstaff K (2012) Machine learning that matters. In: Proceedings of the 29th International Conference on Machine Learning, pp. 529–536

  19. 19.

    Zhu M (2004) Recall, precision and average precision. Technical report 2004–2009. University of Waterloo, Canada, pp. 1–11

Download references

Acknowledgments

I thank the anonymous reviewers very much for their very constructive comments, which have helped me a lot to improve this manuscript.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Daniel Berrar.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Berrar, D. Learning from automatically labeled data: case study on click fraud prediction. Knowl Inf Syst 46, 477–490 (2016). https://doi.org/10.1007/s10115-015-0827-6

Download citation

Keywords

  • Classification
  • Click fraud prediction
  • Big data
  • Random forest
  • Ensemble learning