Automatically Geotagging Articles in the Welsh Newspapers Online Collection
The National Library of Wales’ Welsh Newspapers Online collection comprises over 16 million articles from historic newspapers. It is stored in NLW’s institutional repository, and is a rich source of historic text. The text of the articles has been extracted from the digitised images using OCR. This project investigates methods of determining which articles can be automatically located to places within Wales. We use machine learning, text mining and the OpenStreetMap data as a gazetteer.
- 1.Amitay, E., Har’El, N., Sivan, R., Soffer, A.: Web-a-where: Geotagging web content. In: Proceedings of SIGIR’04, pp. 273–280 (2004)Google Scholar
- 2.Bird, S.: Nltk: The natural language toolkit. In: Proceedings of the COLING/ACL on Interactive Presentation Sessions, pp. 69–72 (2006)Google Scholar
- 3.Buscaldi, D., Rosso, P.: Map-based versus knowledge-based toponym disambiguation. In: Proceedings of GIR’08, pp. 19–22 (2008)Google Scholar
- 6.Lieberman, M.D., Samet, H., Sankaranayananan, J.: Geotagging: using proximity, sibling, and prominence clues to understand comma groups. In: GIR’10, pp. 6:1–6:8 (2010)Google Scholar
- 8.Sultanik, E.A., Fink, C.: Rapid geotagging and disambiguation of social media text via an indexed gazetteer. Proc. ISCRAM 12, 1–10 (2012)Google Scholar