Abstract
Topic Detection and Tracking (TDT) is a research initiative that aims at techniques to organize news documents in terms of news events. We propose a method that incorporates simple semantics into TDT by splitting the term space into groups of terms that have the meaning of the same type. Such a group can be associated with an external ontology. This ontology is used to determine the similarity of two terms in the given group. We extract proper names, locations, temporal expressions and normal terms into distinct sub-vectors of the document representation. Measuring the similarity of two documents is conducted by comparing a pair of their corresponding sub-vectors at a time. We use a simple perceptron to optimize the relative emphasis of each semantic class in the tracking and detection decisions. The results suggest that the spatial and the temporal similarity measures need to be improved. Especially the vagueness of spatial and temporal terms needs to be addressed.
Article PDF
Similar content being viewed by others
References
Allan J (2002a) Introduction to topic detection and tracking. In: Allan (2002b), pp. 1-16.
Allan J (2002b), Ed. Topic Detection and Tracking: Event-based Information Organization. Kluwer Academic Publishers, Norvell, MA, USA.
Allan J, Carbonell J, Doddington G, Yamron J and Yang Y (1998a) Topic detection and tracking pilot study: Final report. In: Proceedings of DARPA Broadcast News Transcription and Understanding Workshop. Lansdowne, VA, pp. 194–218.
Allan J, Jin H, Rajman M, Wayne C, Gildea D, Lavrenko V, Hoberman R and Caputo D (1999) Topic-based novelty detection. Technical Report, Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, Summer Workshop Final Report. http://www.clsp.jhu.edu/ws99/ (visited September 19th, 2003).
Allan J, Lavrenko V and Jin H (2000) First story detection in TDT is hard. In: Proceedings of the 9th International Conference on Information and Knowledge Management (CIKM). ACM Press, pp. 374-381.
Allan J, Lavrenko V and Papka R (1998b) Event tracking. Technical Report IR-128, Department of Computer Science, University of Massachusetts.
Allan J, Lavrenko V and Swan R (2002) Explorations within topic tracking and detection. In: Allan (2002b), pp. 197-224.
Allan J, Papka R and Lavrenko V (1998c) On-line new event detection and tracking. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, pp. 37-45.
Carthy J (2002) Lexical chains for topic tracking. PhD thesis, Department of Computer Science, National University of Dublin.
Central Intelligence Agency, CIA (2003) TheWorld Factbook. http://www.cia.gov/cia/publications/factbook/ (visited September 19th, 2003).
Cieri C, Strassel S, Graff D,Martey N, Rennert K and Liberman M (2002) Corpora for topic detection and tracking. In: Allan (2002b), pp. 33-66.
Cutting DR, Karger DR, Pedersen JO and Tukey JW (1992) Scatter/Gather: A cluster-based approach to browsing large document collections. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, pp. 318-329.
Falk P (1989) The past to come. Economy and Society, 17(3):374–394.
Fiscus J and Doddington G (2002) Topic detection and tracking evaluation overview. In: Allan (2002b), pp. 17-31.
Gerner DJ, Schrodt PA, Francisco R and Weddle JL (1994) The analysis of political events using machine coded data. International Studies Quarterly, 38:91–119.
Goralwalla IA, Leontiev Y, Özsu MT, Szafron D and Combi C (2001) Temporal Granularity: Completing the Puzzle. Journal of Intelligent Information Systems, 16(1):41–63.
Järvinen T and Tapanainen P (1997) A dependency parser for english. Technical Report TR-1, Department of General Linguistics, University of Helsinki.
Joachims T (2002) Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers, Boston.
Krippendorff K (1995) On the reliability of unitizing continuous data. In: Marsden PV, Ed., Sociological Methodology. Blackwell, Cambridge, MA, pp. 47–76.
Lavrenko V, Allan J, DeGuzman E, LaFlamme D, Pollard V and Thomas S (2002) Relevance models for topic detection and tracking. In: Proceedings of Human Language Technology Conference. San Diego, CA, pp. 104-110.
Leek T, Schwartz R and Sista S (2002) Probabilistic approaches to topic detection and tracking. In: Allan (2002b), pp. 67-84.
Makkonen J and Ahonen-Myka H (2003) Utilizing temporal information in topic detection and tracking. In: Koch T and Solveig IT, Eds., Proceedings of the 7th European Conference on Digital Libraries (ECDL). Springer-Verlag, pp. 393-404.
Makkonen J, Ahonen-Myka H and Salmenkivi M (2002) Applying semantic classes in event detection and tracking. In: Sangal R and Bendre SM, Eds., Proceedings of International Conference on Natural Language Processing (ICON). Mumbai, India, pp. 175–183.
Makkonen J, Ahonen-Myka H and Salmenkivi M (2003) Topic detection and tracking with spatio-temporal evidence. In: Sebastiani F, Ed., Proceedings of the 25th European Conference on Information Retrieval Research (ECIR). Springer-Verlag, Heidelberg, pp. 251–265.
Miller GA (1995) WordNet: A lexical database for English. Communications of ACM, 38(11):39–41.
Mitchell TM (1997) Machine Learning. McGraw-Hill. NIMA, National Imagery and Mapping Agency, Geographic Feature names. http://www.nima.mil/gns/html/ index.html (visited September 19th, 2003).
Papka R (1999) On-line new event detection, clustering and tracking. PhD Thesis, Department of Computer Science, University of Massachusetts.
Pons A, Berlanga R and Rumz-Shulcloper J (2002) Temporal-semantic clustering of newspaper articles for event detection. In: Proceedings of Pattern Recognition in Information Systems (PRIS2002). Ciudad Real, Spain, pp. 104-113.
Salton G and Buckley C (1988) Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513–523.
Schultz JM and Liberman MY (2002) Towards a “Universal Dictionary” for multi-language information retrieval applications. In: Allan (2002b), pp. 225-242.
Sebastiani F (2002) Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47.
Swan R and Allan J (1999) Extracting significant time varying features from text. In: Proceedings of the Eighth International Conference on Information and Knowledge Management (CIKM-99). ACM Press, pp. 38-45.
Tilastokeskus (Statistics Finland) http://www.stat.fi (visited September 19th, 2003).
Yamron JP, Gillick L, van Mulbregt P and Knecht S (2002) Statistical models of topical content. In: Allan (2002b), pp. 115-134.
Yang Y, Ault T, Pierce T and Lattimer C (2000) Improving text categorization methods for event detection. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, pp. 65-72.
Yang Y, Carbonell J, Brown R, Lafferty J, Pierce T and Ault T (2002a) Multi-strategy learning for TDT. In: Allan (2002b), pp. 85-114.
Yang Y, Carbonell J, Brown R, Pierce T, Archibald BT and Liu X (1999) Learning approaches for detecting and tracking news events. IEEE Intelligent Systems Special Issue on Applications of Intelligent Information Retrieval, 14(4):32–43.
Yang Y and Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, pp. 42-49.
Yang Y, Zhang J, Carbonell J and Jin C (2002b) Topic-conditioned novelty detection. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, pp. 688-693.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Makkonen, J., Ahonen-Myka, H. & Salmenkivi, M. Simple Semantics in Topic Detection and Tracking. Information Retrieval 7, 347–368 (2004). https://doi.org/10.1023/B:INRT.0000011210.12953.86
Issue Date:
DOI: https://doi.org/10.1023/B:INRT.0000011210.12953.86