Skip to main content
Log in

Analyzing topics and authors in chat logs for crime investigation

Knowledge and Information Systems Aims and scope Submit manuscript

Cite this article


Cybercriminals have been using the Internet to accomplish illegitimate activities and to execute catastrophic attacks. Computer-Mediated Communication such as online chat provides an anonymous channel for predators to exploit victims. In order to prosecute criminals in a court of law, an investigator often needs to extract evidence from a large volume of chat messages. Most of the existing search tools are keyword-based, and the search terms are provided by an investigator. The quality of the retrieved results depends on the search terms provided. Due to the large volume of chat messages and the large number of participants in public chat rooms, the process is often time-consuming and error-prone. This paper presents a topic search model to analyze archives of chat logs for segregating crime-relevant logs from others. Specifically, we propose an extension of the Latent Dirichlet Allocation-based model to extract topics, compute the contribution of authors in these topics, and study the transitions of these topics over time. In addition, we present a special model for characterizing authors-topics over time. This is crucial for investigation because it provides a view of the activity in which authors are involved in certain topics. Experiments on two real-life datasets suggest that the proposed approach can discover hidden criminal topics and the distribution of authors to these topics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others


  1. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  2. Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: Proceedings of the 20th UAI, pp 487–494

  3. Wang X, Mohanty N, McCallum A (2005) Group and topic discovery from relations and text. In: Proceedings of the 3rd ACM LinkKDD, pp 28–35

  4. Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 EMNLP, vol 1, pp 248–256

  5. Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: Proceedings of the 1st SOMA, pp 80–88

  6. Banerjee S, Agarwal N (2012) Analyzing collective behavior from blogs using swarm intelligence. KAIS, pp 1–25

  7. Blei D, McAuliffe J (2008) Supervised topic models. Adv Neural Inf Process Syst 20:121–128

    Google Scholar 

  8. Lacoste-julien S, Sha F, Jordan MI (2008) DiscLDA: discriminative learning for dimensionality reduction and classification. In: Proceedings of the 22nd NIPS, pp 897–904

  9. Ramage D, Heymann P, Manning CD, Garcia-Molina H (2009) Clustering the tagged web. In: Proceedings of the 2nd ACM WSDM, pp 54–63

  10. Rubin T, Chambers A, Smyth P, Steyvers M (2012) Statistical topic models for multi-label document classification. Mach Learn 88:157–208

    Article  MATH  MathSciNet  Google Scholar 

  11. Chang J, Boyd-Graber J, Blei DM (2009) Connections between the lines: augmenting social networks with text. In: Proceedings of the 15th ACM SIGKDD, pp 169–178

  12. Song X, Lin CY, Tseng BL, Sun MT (2005) Modeling and predicting personal information dissemination behavior. In: Proceedings of the 11th ACM SIGKDD, pp 479–488

  13. Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD, pp 424–433

  14. Wang C, Blei DM, Heckerman D (2008) Continuous time dynamic topic models. In: UAI’08, pp 579–586

  15. Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd ICML, pp 113–120

  16. AlSumait L, Barbará D, Domeniconi C (2008) On-line lda: adaptive topic models for mining text streams with applications to topic detection and tracking. In: Proceedings of the 8th IEEE ICDM, pp 3–12

  17. Du L, Buntine W, Jin H, Chen C (2012) Sequential latent dirichlet allocation. KAIS 31:475–503

    Google Scholar 

  18. Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  19. Minka T, Lafferty J (2002) Expectation-propagation for the generative aspect model. In: Proceedings of the 18th UAI, pp 352–359

  20. Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101:5228–5235

    Article  Google Scholar 

  21. Heinrich G (2004) Parameter estimation for text analysis. Technical Report

  22. Zhao WX, Jiang J, Weng J, He J, Lim EP, Yan H, Li X (2011) Comparing twitter and traditional media using topic models. In: Proceedings of the 33rd ECIR. Springer, Berlin, pp 338–349

  23. PJF Inc. Chat log conviction numbers. Available:

  24. Teh YW, Jordan MI, Beal MJ, Blei DM (2004) Sharing clusters among related groups: hierarchical dirichlet processes. In: Proceedings of the 19th NIPS, pp 1385–1392

Download references


The authors would like to thank the anonymous reviewers for their constructive comments that greatly helped improve this paper. The research is supported in part by research grants from Le Fonds québécois de la recherche sur la nature et les technologies (FQRNT) new researchers start-up program, Concordia ENCS seed funding program, and the National Cyber-Forensics and Training Alliance Canada (NCFTA Canada).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Benjamin C. M. Fung.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

M. A. Basher, A.R., Fung, B.C.M. Analyzing topics and authors in chat logs for crime investigation. Knowl Inf Syst 39, 351–381 (2014).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: