Simple Semantics in Topic Detection and Tracking

Abstract

Topic Detection and Tracking (TDT) is a research initiative that aims at techniques to organize news documents in terms of news events. We propose a method that incorporates simple semantics into TDT by splitting the term space into groups of terms that have the meaning of the same type. Such a group can be associated with an external ontology. This ontology is used to determine the similarity of two terms in the given group. We extract proper names, locations, temporal expressions and normal terms into distinct sub-vectors of the document representation. Measuring the similarity of two documents is conducted by comparing a pair of their corresponding sub-vectors at a time. We use a simple perceptron to optimize the relative emphasis of each semantic class in the tracking and detection decisions. The results suggest that the spatial and the temporal similarity measures need to be improved. Especially the vagueness of spatial and temporal terms needs to be addressed.

References

  1. Allan J (2002a) Introduction to topic detection and tracking. In: Allan (2002b), pp. 1-16.

  2. Allan J (2002b), Ed. Topic Detection and Tracking: Event-based Information Organization. Kluwer Academic Publishers, Norvell, MA, USA.

    Google Scholar 

  3. Allan J, Carbonell J, Doddington G, Yamron J and Yang Y (1998a) Topic detection and tracking pilot study: Final report. In: Proceedings of DARPA Broadcast News Transcription and Understanding Workshop. Lansdowne, VA, pp. 194–218.

    Google Scholar 

  4. Allan J, Jin H, Rajman M, Wayne C, Gildea D, Lavrenko V, Hoberman R and Caputo D (1999) Topic-based novelty detection. Technical Report, Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, Summer Workshop Final Report. http://www.clsp.jhu.edu/ws99/ (visited September 19th, 2003).

    Google Scholar 

  5. Allan J, Lavrenko V and Jin H (2000) First story detection in TDT is hard. In: Proceedings of the 9th International Conference on Information and Knowledge Management (CIKM). ACM Press, pp. 374-381.

  6. Allan J, Lavrenko V and Papka R (1998b) Event tracking. Technical Report IR-128, Department of Computer Science, University of Massachusetts.

  7. Allan J, Lavrenko V and Swan R (2002) Explorations within topic tracking and detection. In: Allan (2002b), pp. 197-224.

  8. Allan J, Papka R and Lavrenko V (1998c) On-line new event detection and tracking. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, pp. 37-45.

  9. Carthy J (2002) Lexical chains for topic tracking. PhD thesis, Department of Computer Science, National University of Dublin.

  10. Central Intelligence Agency, CIA (2003) TheWorld Factbook. http://www.cia.gov/cia/publications/factbook/ (visited September 19th, 2003).

  11. Cieri C, Strassel S, Graff D,Martey N, Rennert K and Liberman M (2002) Corpora for topic detection and tracking. In: Allan (2002b), pp. 33-66.

  12. Cutting DR, Karger DR, Pedersen JO and Tukey JW (1992) Scatter/Gather: A cluster-based approach to browsing large document collections. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, pp. 318-329.

  13. Falk P (1989) The past to come. Economy and Society, 17(3):374–394.

    Google Scholar 

  14. Fiscus J and Doddington G (2002) Topic detection and tracking evaluation overview. In: Allan (2002b), pp. 17-31.

  15. Gerner DJ, Schrodt PA, Francisco R and Weddle JL (1994) The analysis of political events using machine coded data. International Studies Quarterly, 38:91–119.

    Google Scholar 

  16. Goralwalla IA, Leontiev Y, Özsu MT, Szafron D and Combi C (2001) Temporal Granularity: Completing the Puzzle. Journal of Intelligent Information Systems, 16(1):41–63.

    Google Scholar 

  17. Järvinen T and Tapanainen P (1997) A dependency parser for english. Technical Report TR-1, Department of General Linguistics, University of Helsinki.

  18. Joachims T (2002) Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers, Boston.

    Google Scholar 

  19. Krippendorff K (1995) On the reliability of unitizing continuous data. In: Marsden PV, Ed., Sociological Methodology. Blackwell, Cambridge, MA, pp. 47–76.

    Google Scholar 

  20. Lavrenko V, Allan J, DeGuzman E, LaFlamme D, Pollard V and Thomas S (2002) Relevance models for topic detection and tracking. In: Proceedings of Human Language Technology Conference. San Diego, CA, pp. 104-110.

  21. Leek T, Schwartz R and Sista S (2002) Probabilistic approaches to topic detection and tracking. In: Allan (2002b), pp. 67-84.

  22. Makkonen J and Ahonen-Myka H (2003) Utilizing temporal information in topic detection and tracking. In: Koch T and Solveig IT, Eds., Proceedings of the 7th European Conference on Digital Libraries (ECDL). Springer-Verlag, pp. 393-404.

  23. Makkonen J, Ahonen-Myka H and Salmenkivi M (2002) Applying semantic classes in event detection and tracking. In: Sangal R and Bendre SM, Eds., Proceedings of International Conference on Natural Language Processing (ICON). Mumbai, India, pp. 175–183.

    Google Scholar 

  24. Makkonen J, Ahonen-Myka H and Salmenkivi M (2003) Topic detection and tracking with spatio-temporal evidence. In: Sebastiani F, Ed., Proceedings of the 25th European Conference on Information Retrieval Research (ECIR). Springer-Verlag, Heidelberg, pp. 251–265.

    Google Scholar 

  25. Miller GA (1995) WordNet: A lexical database for English. Communications of ACM, 38(11):39–41.

    Google Scholar 

  26. Mitchell TM (1997) Machine Learning. McGraw-Hill. NIMA, National Imagery and Mapping Agency, Geographic Feature names. http://www.nima.mil/gns/html/ index.html (visited September 19th, 2003).

  27. Papka R (1999) On-line new event detection, clustering and tracking. PhD Thesis, Department of Computer Science, University of Massachusetts.

  28. Pons A, Berlanga R and Rumz-Shulcloper J (2002) Temporal-semantic clustering of newspaper articles for event detection. In: Proceedings of Pattern Recognition in Information Systems (PRIS2002). Ciudad Real, Spain, pp. 104-113.

  29. Salton G and Buckley C (1988) Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513–523.

    Google Scholar 

  30. Schultz JM and Liberman MY (2002) Towards a “Universal Dictionary” for multi-language information retrieval applications. In: Allan (2002b), pp. 225-242.

  31. Sebastiani F (2002) Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47.

    Google Scholar 

  32. Swan R and Allan J (1999) Extracting significant time varying features from text. In: Proceedings of the Eighth International Conference on Information and Knowledge Management (CIKM-99). ACM Press, pp. 38-45.

  33. Tilastokeskus (Statistics Finland) http://www.stat.fi (visited September 19th, 2003).

  34. Yamron JP, Gillick L, van Mulbregt P and Knecht S (2002) Statistical models of topical content. In: Allan (2002b), pp. 115-134.

  35. Yang Y, Ault T, Pierce T and Lattimer C (2000) Improving text categorization methods for event detection. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, pp. 65-72.

  36. Yang Y, Carbonell J, Brown R, Lafferty J, Pierce T and Ault T (2002a) Multi-strategy learning for TDT. In: Allan (2002b), pp. 85-114.

  37. Yang Y, Carbonell J, Brown R, Pierce T, Archibald BT and Liu X (1999) Learning approaches for detecting and tracking news events. IEEE Intelligent Systems Special Issue on Applications of Intelligent Information Retrieval, 14(4):32–43.

    Google Scholar 

  38. Yang Y and Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, pp. 42-49.

  39. Yang Y, Zhang J, Carbonell J and Jin C (2002b) Topic-conditioned novelty detection. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, pp. 688-693.

Download references

Author information

Affiliations

Authors

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Makkonen, J., Ahonen-Myka, H. & Salmenkivi, M. Simple Semantics in Topic Detection and Tracking. Information Retrieval 7, 347–368 (2004). https://doi.org/10.1023/B:INRT.0000011210.12953.86

Download citation

  • topic detection and tracking
  • retrieval model
  • information extraction
  • temporal expression
  • geographical ontology