New/s/leak 2.0 – Multilingual Information Extraction and Visualization for Investigative Journalism

  • Gregor Wiedemann
  • Seid Muhie YimamEmail author
  • Chris Biemann
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11186)


Investigative journalism in recent years is confronted with two major challenges: (1) vast amounts of unstructured data originating from large text collections such as leaks or answers to Freedom of Information requests, and (2) multi-lingual data due to intensified global cooperation and communication in politics, business and civil society. Faced with these challenges, journalists are increasingly cooperating in international networks. To support such collaborations, we present the new version of new/s/leak 2.0, our open-source software for content-based searching of leaks. It includes three novel main features: (1) automatic language detection and language-dependent information extraction for 40 languages, (2) entity and keyword visualization for efficient exploration, and (3) decentral deployment for analysis of confidential data from various formats. We illustrate the new analysis capabilities with an exemplary case study.


Information extraction Investigative journalism Data journalism Named entity recognition Keyterm extraction 



The work was funded by Volkswagen Foundation under Grant Nr. 90 847.


  1. 1.
    Al-Rfou, R., Kulkarni, V., Perozzi, B., Skiena, S.: Polyglot-NER: massive multilingual named entity recognition. In: Proceedings of the 2015 SIAM International Conference on Data Mining, Vancouver, British Columbia, Canada, 30 April – 2 May 2015 (2015)CrossRefGoogle Scholar
  2. 2.
    Bostock, M., Ogievetsky, V., Heer, J.: D3 data-driven documents. IEEE Trans. Vis. Comput. Graph. (Proc. InfoVis) 17(12), 2301–2309 (2011)CrossRefGoogle Scholar
  3. 3.
    Brehmer, M., Ingram, S., Stray, J., Munzner, T.: Overview: the design, adoption, and analysis of a visual document mining tool for investigative journalists. IEEE Trans. Vis. Comput. Graph. 20(12), 2271–2280 (2014)CrossRefGoogle Scholar
  4. 4.
    Ferrucci, D., Lally, A.: UIMA: an architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng. 10(3–4), 327–348 (2004)CrossRefGoogle Scholar
  5. 5.
    Goldhahn, D., Eckart, T., Quasthoff, U.: Building large monolingual dictionaries at the Leipzig Corpora Collection: from 100 to 200 languages. In: Proceedings of the 8th International Conference on Language Resources and Evaluation, pp. 759–765 (2012)Google Scholar
  6. 6.
    Görg, C., Liu, Z., Kihm, J., Choo, J., Park, H., Stasko, J.: Combining computational analyses and interactive visualization for document exploration and sensemaking in jigsaw. IEEE Trans. Vis. Comput. Graph. 19(10), 1646–1663 (2013)CrossRefGoogle Scholar
  7. 7.
    Görg, C., Liu, Z., Stasko, J.: Reflections on the evolution of the jigsaw visual analytics system. Inf. Vis. 13(4), 336–345 (2014)CrossRefGoogle Scholar
  8. 8.
    ICU4J: ICU4J 61.1 API specification (2018).
  9. 9.
    Obermayer, B.: Das sind die paradise papers: Ein neues Leak erschüttert Konzerne, Politiker und die Welt der Superreichen. Süddeutsche Zeitung (05112017).
  10. 10.
    O’Donovan, J., Wagner, H.F., Zeume, S.: The value of offshore secrets evidence from the panama papers. SSRN Electron. J. (2016).
  11. 11.
    Rayson, P., Berridge, D., Francis, B.: Extending the cochran rule for the comparison of word frequencies between corpora. In: Proceedings of the 7th International Conference on Statistical analysis of textual data, pp. 926–936 (2004)Google Scholar
  12. 12.
    Schwabish, S.R.J., Bowers, D.: Data Journalisim in 2017. Google News Lab (2017)Google Scholar
  13. 13.
    Stasko, J., Görg, C., Liu, Z.: Jigsaw: supporting investigative analysis through interactive visualization. Inf. Vis. 7(2), 118–132 (2008)CrossRefGoogle Scholar
  14. 14.
    Strötgen, J., Gertz, M.: A baseline temporal tagger for all languages. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 541–547. Lisbon, Portugal (2015)Google Scholar
  15. 15.
    Thomson Reuters: Open Calais: API user guide (2017).
  16. 16.
    Yimam, S.M., et al.: new/s/leak - information extraction and visualization for investigative data journalists. In: Proceedings of ACL-2016 System Demonstrations, pp. 163–168. Berlin, Germany (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Gregor Wiedemann
    • 1
  • Seid Muhie Yimam
    • 1
    Email author
  • Chris Biemann
    • 1
  1. 1.Language Technology Group, Department of InformaticsMIN Faculty, Universität HamburgHamburgGermany

Personalised recommendations