Linking Archives Using Document Enrichment and Term Selection

  • Marc Bron
  • Bouke Huurnink
  • Maarten de Rijke
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6966)


News, multimedia and cultural heritage archives are increasingly offering opportunities to create connections between their collections. We consider the task of linking archives: connecting an item in one archive to one or more items in other, often complementary archives. We focus on a specific instance of the task: linking items with a rich textual representation in a news archive to items with sparse annotations in a multimedia archive, where items should be linked if they describe the same or a related event. We find that the difference in textual richness of annotations presents a challenge and investigate two approaches: (i) to enrich sparsely annotated items with textually rich content; and (ii) to reduce rich news archive items using term selection. We demonstrate the positive impact of both approaches on linking to same events and linking to related events.


Target Item Term Selection News Article Mean Average Precision Source Archive 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Allan, J., Carbonell, J., Doddington, G., Yamron, J., Yang, Y., et al.: Topic detection and tracking pilot study: Final report. In: Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, pp. 194–218 (1998)Google Scholar
  2. 2.
    Bron, M., van Gorp, J., Nack, F., de Rijke, M.: Exploratory search in an audio-visual archive: Evaluating a professional search tool for non-professional users. In: EuroHCIR 2011: 1st European Workshop on Human-Computer Interaction and Information Retrieval (July 2011)Google Scholar
  3. 3.
    Carrick, C., Watters, C.: Automatic association of news items. Information Processing & Management 33(5), 615–632 (1997)CrossRefGoogle Scholar
  4. 4.
    Cohn, D., Hofmann, T.: The missing link-a probabilistic model of document content and hypertext connectivity. In: NIPS 2001, pp. 430–436 (2001)Google Scholar
  5. 5.
    Diaz, F., Metzler, D.: Improving the estimation of relevance models using large external corpora. In: SIGIF 2006, pp. 154–161. ACM, New York (2006)Google Scholar
  6. 6.
    Finkel, J., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: ACL 2005, pp. 363–370. ACL (2005)Google Scholar
  7. 7.
    Franz, M., Ward, T., McCarley, J., Zhu, W.: Unsupervised and supervised clustering for topic tracking. In: SIGIR 2001, pp. 310–317. ACM, New York (2001)Google Scholar
  8. 8.
    Harman, D.K.: The TREC test collections. In: Voorhees, E.M., Harman, D.K. (eds.) TREC: Experiment and Evaluation in Information Retrieval. MIT, Cambridge (2005)Google Scholar
  9. 9.
    Henzinger, M., Chang, B.-W., Milch, B., Brin, S.: Query-free news search. In: World Wide Web, vol. 8, pp. 101–126 (2005)Google Scholar
  10. 10.
    Huurnink, B., Hollink, L., van den Heuvel, W., de Rijke, M.: Search behavior of media professionals at an audiovisual archive: A transaction log analysis. J. American Soc. Information Science and Technology 61(6), 1180–1197 (2010)Google Scholar
  11. 11.
    Kern, R., Granitzer, M.: German encyclopedia alignment based on information retrieval techniques. In: Lalmas, M., Jose, J., Rauber, A., Sebastiani, F., Frommholz, I. (eds.) ECDL 2010. LNCS, vol. 6273, pp. 315–326. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  12. 12.
    Kumaran, G., Allan, J.: Text classification and named entities for new event detection. In: SIGIR 2004, pp. 297–304. ACM, New York (2004)Google Scholar
  13. 13.
    Li, Z., Wang, B., Li, M., Ma, W.: A probabilistic model for retrospective news event detection. In: SIGIR 2005, pp. 106–113. ACM, New York (2005)Google Scholar
  14. 14.
    Ma, Q., Nadamoto, A., Tanaka, K.: Complementary information retrieval for cross-media news content. Information Systems 31(7), 659–678 (2006)CrossRefGoogle Scholar
  15. 15.
    Meij, E., Bron, M., Hollink, L., Huurnink, B., de Rijke, M.: Learning semantic query suggestions. In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum, L., Maynard, D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823, pp. 424–440. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  16. 16.
    Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In: CIKM 2007, vol. 7, pp. 233–242 (2007)Google Scholar
  17. 17.
    Radev, D., Otterbacher, J., Winkel, A., Blair-Goldensohn, S.: NewsInEssence: summarizing online news topics. Comm. of the ACM 48(10), 95–98 (2005)CrossRefGoogle Scholar
  18. 18.
    Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Comm. of the ACM 18(11), 613–620 (1975)CrossRefzbMATHGoogle Scholar
  19. 19.
    Tao, T., Wang, X., Mei, Q., Zhai, C.: Language model information retrieval with document expansion. In: HLT-NAACL 2006, pp. 407–414 (2006)Google Scholar
  20. 20.
    Tsagkias, M., de Rijke, M., Weerkamp, W.: Linking online news and social media. In: WSDM 2011, pp. 565–574. ACM, New York (2011)Google Scholar
  21. 21.
    Zhang, Y., Callan, J., Minka, T.: Novelty and redundancy detection in adaptive filtering. In: SIGIR 2002, pp. 81–88. ACM, New York (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Marc Bron
    • 1
  • Bouke Huurnink
    • 1
  • Maarten de Rijke
    • 1
  1. 1.ISLAUniversity of AmsterdamAmsterdamThe Netherlands

Personalised recommendations