Computationally efficient scoring of activity using demographics and connectivity of entities

Abstract

Consider a collection of entities, where each may have some demographic properties, and where the entities may be linked in some kind of, perhaps social, network structure. Some of these entities are “of interest”—we call them active. What is the relative likelihood of each of the other entities being active? AFDL, Activity from Demographics and Links, is an algorithm designed to answer this question in a computationally-efficient manner. AFDL is able to work with demographic data, link data (including noisy links), or both; and it is able to process very large datasets quickly. This paper describes AFDL’s feature extraction and classification algorithms, gives timing and accuracy results obtained for several datasets, and offers suggestions for its use in real-world situations.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Notes

  1. 1.

    AFDL and NetKit have been run on an AMD Opteron 242 dual CPU, 1,600 MHz, 8 GB RAM machine under CentOS 4 ×86_64, except for NetKit IMDB runs which were executed on a faster machine with more memory: AMD Opteron 844 quad CPU, 1,800 MHz, 32 with GB of RAM. We obtained NetKit from http://www.research.rutgers.edu/~sofmac/NetKit.html and ran it without modifications using default parameter settings for this setup: local classifier = null, relational classifier = wvRN [9], collective inference = relaxation labeling [14].

References

  1. 1.

    Getoor L, Diehl CP (2005) Link mining: a survey, SIGKDD explorations. 7(2):3–12

  2. 2.

    Domingos P (2003) Prospects and challenges for multi-relational data mining, SIGKDD explorations. 5(1):80–83

  3. 3.

    Fawcett T, Provost F (2003) Adaptive fraud detection. Data Min Knowl Disc 3:291–316

    Google Scholar 

  4. 4.

    Cortes C, Pregibon D, Volinsky C (2004) Communities of interest. In: Proceedings of intelligent data analysis (IDA)

  5. 5.

    Neville J, Simsek O, Jensen D, Komoroske J, Palmer K, Goldberg H (2005) Using relational knowledge discovery to prevent securities fraud. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery and data mining (KDD-05)

  6. 6.

    Kubica J, Moore A, Cohn D, Schneider J (2003) A fast graph-based method for link analysis and queries. In: Proceedings of the 2003 IJCAI text-mining & link-analysis workshop

  7. 7.

    Kubica J, Moore A, Schneider J, Yang Y (2002) Stochastic link and group detection, eighteenth national conference on artificial intelligence

  8. 8.

    Sofus A (2006) Macskassy and foster provost. A brief survey of machine learning methods for classification in networked data and an application to suspicion scoring. Workshop on statistical network learning at 23rd international conference on machine learning ICML 2006, Pittsburgh, PA, USA, June 2006

  9. 9.

    Sofus A (2003) Macskassy and foster provost. A simple relational classifier. In: Proceedings of the multi-relational data mining workshop (MRDM) at the ninth ACM SIGKDD international conference on knowledge discovery and data mining

  10. 10.

    Sofus A (2005) Macskassy and foster provost. Suspicion scoring based on guilt-by-association, collective inference, and focused data access. International conference on intelligence analysis

  11. 11.

    Macskassy SA, Provost F (2006) Classification in networked data: a toolkit and a univariate case study. J Mach Learn Res (forthcoming)

  12. 12.

    Komarek P (2004) Logistic regression for data mining and high-dimensional classification, Ph.D Thesis, Carnegie Mellon University

  13. 13.

    Dubrawski A (1997) Stochastic validation for automated tuning of neural network’s hyper-parameters. J Rob Auton Syst 21(1):89–93 Elsevier Science Publishers

    Google Scholar 

  14. 14.

    Chakrabarti S, Dom B, Indyk P (1998) Enhanced hypertext categorization using hyperlinks. In: ACM SIGMOD international conference on management of data

  15. 15.

    Box GEP, Draper NR (1987) Empirical model building and response surfaces. Wiley

  16. 16.

    Moore A, Schneider J (1995) Memory based stochastic optimization. In: Advances in neural information processing systems (NIPS 8)

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Artur W. Dubrawski.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Dubrawski, A.W., Ostlund, J.K., Chen, L. et al. Computationally efficient scoring of activity using demographics and connectivity of entities. Inf Technol Manag 11, 77–89 (2010). https://doi.org/10.1007/s10799-010-0069-y

Download citation

Keywords

  • Link analysis
  • Suspicion scoring
  • Multi-relational data mining
  • Graph algorithms