Knowledge and Information Systems

, Volume 39, Issue 2, pp 351–381 | Cite as

Analyzing topics and authors in chat logs for crime investigation

  • Abdur Rahman M. A. Basher
  • Benjamin C. M. Fung
Regular Paper

Abstract

Cybercriminals have been using the Internet to accomplish illegitimate activities and to execute catastrophic attacks. Computer-Mediated Communication such as online chat provides an anonymous channel for predators to exploit victims. In order to prosecute criminals in a court of law, an investigator often needs to extract evidence from a large volume of chat messages. Most of the existing search tools are keyword-based, and the search terms are provided by an investigator. The quality of the retrieved results depends on the search terms provided. Due to the large volume of chat messages and the large number of participants in public chat rooms, the process is often time-consuming and error-prone. This paper presents a topic search model to analyze archives of chat logs for segregating crime-relevant logs from others. Specifically, we propose an extension of the Latent Dirichlet Allocation-based model to extract topics, compute the contribution of authors in these topics, and study the transitions of these topics over time. In addition, we present a special model for characterizing authors-topics over time. This is crucial for investigation because it provides a view of the activity in which authors are involved in certain topics. Experiments on two real-life datasets suggest that the proposed approach can discover hidden criminal topics and the distribution of authors to these topics.

Keywords

Latent Dirichlet Allocation (LDA) Topic modeling Gibbs sampling Topic evolution Author-topics over time  Cybercrime 

References

  1. 1.
    Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATHGoogle Scholar
  2. 2.
    Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: Proceedings of the 20th UAI, pp 487–494Google Scholar
  3. 3.
    Wang X, Mohanty N, McCallum A (2005) Group and topic discovery from relations and text. In: Proceedings of the 3rd ACM LinkKDD, pp 28–35Google Scholar
  4. 4.
    Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 EMNLP, vol 1, pp 248–256Google Scholar
  5. 5.
    Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: Proceedings of the 1st SOMA, pp 80–88Google Scholar
  6. 6.
    Banerjee S, Agarwal N (2012) Analyzing collective behavior from blogs using swarm intelligence. KAIS, pp 1–25Google Scholar
  7. 7.
    Blei D, McAuliffe J (2008) Supervised topic models. Adv Neural Inf Process Syst 20:121–128Google Scholar
  8. 8.
    Lacoste-julien S, Sha F, Jordan MI (2008) DiscLDA: discriminative learning for dimensionality reduction and classification. In: Proceedings of the 22nd NIPS, pp 897–904Google Scholar
  9. 9.
    Ramage D, Heymann P, Manning CD, Garcia-Molina H (2009) Clustering the tagged web. In: Proceedings of the 2nd ACM WSDM, pp 54–63Google Scholar
  10. 10.
    Rubin T, Chambers A, Smyth P, Steyvers M (2012) Statistical topic models for multi-label document classification. Mach Learn 88:157–208CrossRefMATHMathSciNetGoogle Scholar
  11. 11.
    Chang J, Boyd-Graber J, Blei DM (2009) Connections between the lines: augmenting social networks with text. In: Proceedings of the 15th ACM SIGKDD, pp 169–178Google Scholar
  12. 12.
    Song X, Lin CY, Tseng BL, Sun MT (2005) Modeling and predicting personal information dissemination behavior. In: Proceedings of the 11th ACM SIGKDD, pp 479–488Google Scholar
  13. 13.
    Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD, pp 424–433Google Scholar
  14. 14.
    Wang C, Blei DM, Heckerman D (2008) Continuous time dynamic topic models. In: UAI’08, pp 579–586Google Scholar
  15. 15.
    Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd ICML, pp 113–120Google Scholar
  16. 16.
    AlSumait L, Barbará D, Domeniconi C (2008) On-line lda: adaptive topic models for mining text streams with applications to topic detection and tracking. In: Proceedings of the 8th IEEE ICDM, pp 3–12Google Scholar
  17. 17.
    Du L, Buntine W, Jin H, Chen C (2012) Sequential latent dirichlet allocation. KAIS 31:475–503Google Scholar
  18. 18.
    Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, CambridgeCrossRefMATHGoogle Scholar
  19. 19.
    Minka T, Lafferty J (2002) Expectation-propagation for the generative aspect model. In: Proceedings of the 18th UAI, pp 352–359Google Scholar
  20. 20.
    Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101:5228–5235CrossRefGoogle Scholar
  21. 21.
    Heinrich G (2004) Parameter estimation for text analysis. Technical ReportGoogle Scholar
  22. 22.
    Zhao WX, Jiang J, Weng J, He J, Lim EP, Yan H, Li X (2011) Comparing twitter and traditional media using topic models. In: Proceedings of the 33rd ECIR. Springer, Berlin, pp 338–349Google Scholar
  23. 23.
    PJF Inc. Chat log conviction numbers. Available: http://www.ciise.concordia.ca/~fung/pub/convictions.txt
  24. 24.
    Teh YW, Jordan MI, Beal MJ, Blei DM (2004) Sharing clusters among related groups: hierarchical dirichlet processes. In: Proceedings of the 19th NIPS, pp 1385–1392Google Scholar

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  • Abdur Rahman M. A. Basher
    • 1
  • Benjamin C. M. Fung
    • 1
  1. 1.Concordia Institute for Information Systems EngineeringConcordia UniversityMontrealCanada

Personalised recommendations