Person-Centric Mining of Historical Newspaper Collections

  • Mariona Coll ArdanuyEmail author
  • Jürgen Knauth
  • Andrei Beliankou
  • Maarten van den Bos
  • Caroline SporlederEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9819)


We present a text mining environment that supports entity-centric mining of terascale historical newspaper collections. Information about entities and their relation to each other is often crucial for historical research. However, most text mining tools provide only very basic support for dealing with entities, typically at most including facilities for entity tagging. Historians, on the other hand, are typically interested in the relations between entities and the contexts in which these are mentioned. In this paper, we focus on person entities. We provide an overview of the tool and describe how person-centric mining can be integrated in a general-purpose text mining environment. We also discuss our approach for automatically extracting person networks from newspaper archives, which includes a novel method for person name disambiguation, which is particularly suited for the newspaper domain and obtains state-of-the-art disambiguation results.


Multilingual text mining Historical text mining Person name disambiguation Semantic search 


  1. 1.
    Al-Kamha, R., Embley, D.W.: Grouping search-engine returned citations for person-name queries. In: Proceedings of the 6th ACM WIDM Workshop, pp. 96–103 (2004)Google Scholar
  2. 2.
    Bagga, A., Baldwin, B.: Entity-based cross-document coreferencing using the vector space model. In: Proceedings of Coling, pp. 79–85 (1998)Google Scholar
  3. 3.
    Bentivogli, L., Marchetti, A., Pianta, E.: Creating a gold standard for person cross-document coreference resolution in Italian news. In: Proceedings of LREC Workshop on Resources and Evaluation for Identity Matching, Entity Resolution and Entity Management, pp. 19–26 (2008)Google Scholar
  4. 4.
    Bentivogli, L., Marchetti, A., Pianta, E.: The news people search task at EVALITA 2011: evaluating cross-document coreference resolution of named person entities in Italian news. In: Sprugnoli, R. (ed.) EVALITA 2012. LNCS, vol. 7689, pp. 126–134. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  5. 5.
    Blume, M.: Automatic entity disambiguation: benefits to NER, relation extraction, link analysis, and inference. In: Proceedings of the International Conference on Intelligence Analysis (2005)Google Scholar
  6. 6.
    Bollegala, D., Matsuo, Y., Ishizuka, M.: Extracting key phrases to disambiguate personal name queries in web search. In: Proceedings of the ACL Workshop on How Can Computational Linguistics Improve Information Retrieval? (2006)Google Scholar
  7. 7.
    Bunescu, R., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of EACL, pp. 9–16 (2006)Google Scholar
  8. 8.
    Chen, Y., Martin, J.: Towards robust unsupervised personal name disambiguation. In: Proceedings of EMNLP-CoNLL, pp. 190–198 (2007)Google Scholar
  9. 9.
    Coll Ardanuy, M., van den Bos, M., Sporleder, C.: Laboratories of community: how digital humanities can further new European integration history. In: Aiello, L.M., McFarland, D. (eds.) SocInfo 2014 Workshops. LNCS, vol. 8852, pp. 284–293. Springer, Heidelberg (2015)Google Scholar
  10. 10.
    Coll Ardanuy, M., Sporleder, C.: You shall know people by the company they keep: person name disambiguation for social network construction. In: Proceedings of LaTeCH 2016 (forthcoming)Google Scholar
  11. 11.
    Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data. In: Proceedings of EMNLP-CoNLL, pp. 708–716 (2007)Google Scholar
  12. 12.
    de Rooij, O., Vishneuski, A., de Rijke, M.: xTAS: text analysis in a timely manner. In: 12th Dutch-Belgian Information Retrieval Workshop (2012)Google Scholar
  13. 13.
    Dutta, S., Weikum, G.: Cross-document co-reference resolution using sample-based clustering with knowledge enrichment. TACL 3, 15–28 (2015)Google Scholar
  14. 14.
    Elson, D.K., Dames, N., McKeown, K.R.: Extracting social networks from literary fiction. In: Proceedings of ACL, pp. 138–147 (2010)Google Scholar
  15. 15.
    Gooi, C.H., Allan, J.: Cross-document coreference on a large scale corpus. In: Proceedings of HLT-NAACL, pp. 9–16 (2004)Google Scholar
  16. 16.
    Han, X., Sun, L.: An entity-topic model for entity linking. In: Proceedings of EMNLP-CoNLL 2012, pp. 105–115 (2012)Google Scholar
  17. 17.
    Han, X., Zhao, J.: Named entity disambiguation by leveraging Wikipedia semantic knowledge. In: Proceedings of CIKM, pp. 215–224 (2009)Google Scholar
  18. 18.
    Jackson, C.A.: Using Social Network Analysis to Reveal Unseen Relationships in Medieval Scotland. In: Digital Humanities Conference, Lausanne (2014)Google Scholar
  19. 19.
    Kalashnikov, D.V., Chen, S., Nuray, R., Mehrotra, S., Ashish, N.: Disambiguation algorithm for people search on the web. In: Proceedings of IEEE International Conference on Data Engineering, pp. 1258–1260 (2007)Google Scholar
  20. 20.
    Kozareva, Z., Ravi, R.: Unsupervised name ambiguity resolution using a generative model. In: Proceedings of EMNLP, pp. 105–112 (2011)Google Scholar
  21. 21.
    Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation. In: Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL, pp. 33–40 (2003)Google Scholar
  22. 22.
    Niu, C., Li, W., Srihari, R.K.: Weakly supervised learning for cross-document person name disambiguation supported by information extraction. In: Proceedings of ACL, pp. 598–605 (2004)Google Scholar
  23. 23.
    Padgett, J.F., Ansell, C.K.: Robust action and the rise of the Medici, 1400–1434. Am. J. Sociol. 98(6), 1259–1319 (1993)CrossRefGoogle Scholar
  24. 24.
    Pieters, T., Verheul, J.: Cultural text mining: using text mining to map the emergence of transnational reference cultures in public media repositories. In: Digital Humanities 2014 Book of Abstracts, pp. 299–301 (2014)Google Scholar
  25. 25.
    Popescu, O.: Person cross document coreference with name perplexity estimates. In: Proceedings of EMNLP, pp. 997–1006 (2009)Google Scholar
  26. 26.
    Popescu, O., Magnini, B.: IRST-BP: web people search using name entities. In: Proceedings of SemEval, pp. 195–198 (2007)Google Scholar
  27. 27.
    Rao, D., McNamee, P., Dredze, M.: Streaming cross document entity coreference resolution. In: Proceedings of Coling, pp. 1050–1058 (2010)Google Scholar
  28. 28.
    Ravin, Y., Kazi, Z.: Is Hillary Rodham Clinton the president? disambiguating name across documents. In: Proceedings of the Workshop on Coreference and its Applications, pp. 9–16 (1999)Google Scholar
  29. 29.
    Rochat, Y., Fournier, M., Mazzei, A., Kaplan, F.: A network analysis approach of the venetian incanto system. In: Digital Humanities Conference, Lausanne (2014)Google Scholar
  30. 30.
    Song, Y., Huang, J., Councill, I.G., Li, J., Lee Giles, C.: Efficient topic-based unsupervised name disambiguation. In: Proceedings of JCDL, pp. 342–351 (2007)Google Scholar
  31. 31.
    Stratford, E., Browne, J.: LinkedIn circa 2000 BCE: Towards a Network Model of Pušu-ken’s Commercial Relationships in Old Assyria. Digital Humanities Conference, Sydney (2015)Google Scholar
  32. 32.
    Torget, A.J., Mihalcea, R., Christensen, J., McGhee, G.: Mapping texts: combining text mining and geo-visualization to unlock the research potential of historical newspapers. In: National Endowment for the Humanities (2011)Google Scholar
  33. 33.
    Yoshida, M., Ikeda, M., Ono, S., Sato, I., Nakagawa, H.: Person name disambiguation by bootstrapping. In: Proceedings of SIGIR, pp. 10–17 (2010)Google Scholar
  34. 34.
    Zanoli, R., Corcoglioniti, F., Girardi, C.: Exploiting background knowledge for clustering person names. In: Sprugnoli, R. (ed.) EVALITA 2012. LNCS, vol. 7689, pp. 135–145. Springer, Heidelberg (2012)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Mariona Coll Ardanuy
    • 1
    Email author
  • Jürgen Knauth
    • 1
  • Andrei Beliankou
    • 2
  • Maarten van den Bos
    • 3
  • Caroline Sporleder
    • 1
    Email author
  1. 1.Göttingen Centre for Digital HumanitiesGöttingen UniversityGöttingenGermany
  2. 2.Department of Computational Linguistics and Digital HumanitiesTrier UniversityTrierGermany
  3. 3.Department of History and Art HistoryUtrecht UniversityUtrechtThe Netherlands

Personalised recommendations