Skip to main content

Using temporal bursts for query modeling

Abstract

We present an approach to query modeling that leverages the temporal distribution of documents in an initially retrieved set of documents. In news-related document collections such distributions tend to exhibit bursts. Here, we define a burst to be a time period where unusually many documents are published. In our approach we detect bursts in result lists returned for a query. We then model the term distributions of the bursts using a reduced result list and select its most descriptive terms. Finally, we merge the sets of terms obtained in this manner so as to arrive at a reformulation of the original query. For query sets that consist of both temporal and non-temporal queries, our query modeling approach incorporates an effective selection method of terms. We consistently and significantly improve over various baselines, such as relevance models, on both news collections and a collection of blog posts.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Notes

  1. We assume that R(D) takes values between 0 and 1.

  2. Burst B 1 is maximal if there is no burst B 2 such that \(B_1\subseteq B_2\) and B 1 ≠ B 2.

  3. We use the following indicators: number of pronouns, amount of punctuation, number of emoticons used, amount of shouting, whether capitalization was used, the length of the post, and correctness of spelling.

  4. See http://odur.let.rug.nl/%7Evannoord/TextCat/.

  5. http://en.wikipedia.org/wiki/Category:Events.

  6. These classes being occurence, perception, reporting, aspectual, state, i_state, and i_action.

  7. We used the standard settings of GibbsLDA++ (http://gibbslda.sourceforge.net/), with 10 clusters.

  8. We considered the following ranges: \(\gamma \in \{-1, -0.9,\ldots,-0.1, -0.09, \ldots, -0.01, \ldots, -0.001, \ldots, -0.0001\}, k \in \{2, 4, 6, 8, 10, 20, 30, 50\}, \) and \(\alpha \in \{0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5\}.\)

  9. Numb3rs was an American crime drama television series that ran in the US between 2005 and 2010.

References

  • Alonso, O., Strötgen, J., Baeza-Yates, R., & Gertz, M. (2011). Temporal information retrieval: Challenges and opportunities. In Proceedings of the 1st international temporal web analytics workshop (TWAW 2011), pp. 1–8.

  • Amodeo, G., Amati, G., & Gambosi, G. (2011). On relevance, time and query expansion. In CIKM ’11: Proceedings of the 20th ACM international conference on Information and knowledge management (pp. 1973–1976). New York, NY: ACM.

  • Balog, K., Weerkamp, W. & de Rijke, M. (2008). A few examples go a long way: Constructing query models from elaborate query formulations. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, (pp. 371–378). New York, NY: ACM. ISBN 978-1-60558-164-4.

  • Balog, K., Bron, M., & de Rijke, M. (2010). Category-based query modeling for entity search. In ECIR 2010: 32nd European conference on information retrieval, pp. 319–331.

  • Berberich, K., Bedathur, S., Alonso, O., & Weikum, G. (2010). A language modeling approach for temporal information needs. In ECIR 2010: 32nd European conference on information retrieval, Berlin: Springer .

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(4-5), 993–1022.

    MATH  Google Scholar 

  • Bron, M., Balog, K., & de Rijke, M. (2010). Ranking related entities: Components and analyses. In CIKM ’10: 19th ACM international conference on information and knowledge management, Toronto: ACM.

  • Chien, S., & Immorlica, N. (2005). Semantic similarity between search engine queries using temporal correlation. In Proceedings of the 14th international conference on World Wide Web (WWW ’05), (pp. 2–11). New York, NY: ACM.

  • Corso, G. M. D., Gullí, A., & Romani, F. (2005). Ranking a stream of news. In Proceedings of the 14th international conference on the World Wide Web (WWW ’05).

  • Cover, T. M., & Hart, P. E. (1967). Nearest neighbour pattern classification. In Institute of electrical and electronics engineers transactions on information theory, 13, pp. 21–27

  • Dakka, W., Gravano, L., & Ipeirotis, P. G. (2012). Answering general time-sensitive queries. IEEE Transactions on Knowledge and Data Engineering, 24(2), 220–235

    Article  Google Scholar 

  • Diaz, F. & Metzler, D. (2006). Improving the estimation of relevance models using large external corpora. In SIGIR ’06: 29th annual international ACM SIGIR conference on research & development on information retrieval, pp. 154–161.

  • Dong, A., Zhang, R., Kolari, P., Bai, J., Diaz, F., Chang, Y., Zheng, Z., & Zha, H. (2010). Time is of the essence: improving recency ranking using twitter data. In Proceedings of the 19th international conference on World wide web (WWW ’10), (pp. 331–340). New York, NY: ACM.

  • Efron, M. (2010). Linear time series models for term weighting in information retrieval. Journal of the American Society for Information Science and Technology, 6(7), 1299–1312.

    Article  Google Scholar 

  • Efron, M. & Golovchinsky, G. (2011) Estimation methods for ranking recent information. In SIGIR ’11: 34th annual international ACM SIGIR conference on research & development on information retrieval, pp. 495–504.

  • Hamilton, J. D. (1994). Time-series analysis, 1 edn. Princeton, NJ: Princeton Univerity Press.

    MATH  Google Scholar 

  • Hofmann, K. & Weerkamp, W. (2008). Content extraction for information retrieval in blogs and intranets. Technical report, University of Amsterdam .

  • Jaleel, N. A., Allan, J., Croft, W. B., Diaz, F., Larkey, L. S., Li, X., Smucker, M. D., & Wade, C. (2004). UMass at TREC 2004: Novelty and hard. In TREC 2004.

  • Java, A., Kolari, P., Finin, T., Joshi, A. & Martineau, J. (2006) The BlogVox opinion retrieval system. In TREC 2006.

  • Jones, R. & Diaz, F. (2007). Temporal profiles of queries. ACM Transaction Informayion Systems, 25.

  • Kamps, J. (2004). Improving retrieval effectiveness by reranking documents based on controlled vocabulary. In Advances in information retrieval: 26th European conference on IR research (ECIR 2004), (pp. 283–295). Heidelberg: Springer.

  • Keikha, M., Gerani, S., & Crestani, F. (2011a) Time-based relevance models. In SIGIR ’11: Proceedings of the 34th international ACM SIGIR conference on research and development in Information, (pp. 1087–1088). New York, NY: ACM.

  • Keikha, M., Gerani, S., & Crestani, F. (2011b). Temper: a temporal relevance feedback method. In ECIR 2011: 33rd European conference on information retrieval.

  • Kleinberg, J. M. (2002). Bursty and hierarchical structure in streams. In KDD ’02: The eighth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp. 91–101.

  • Kulkarni, A., Teevan, J., Svore, K. M., & Dumais, S. T. (2011). Understanding temporal query dynamics. In WSDM 2011: The fourth ACM international conference on Web search and data mining, WSDM ’11. ACM, 2011.

  • Lavrenko, V., & Croft, W. B. (2001). Relevance based language models. In SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, (pp. 120–127). New York, NY: ACM.

  • Li, X., & Croft, W. B. (2003). Time-based language models. In CIKM ’03: International conference on information and knowledge management.

  • Macdonald, C., & Ounis, I. (2006). The TREC blogs06 collection: Creating and analyzing a blog test collection. Technical report TR-2006-224, U. Glasgow.

  • Manning, C., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.

    Book  MATH  Google Scholar 

  • Martins, B., Manguinhas, H., & Borbinha, J. (2008). Extracting and exploring the geo-temporal semantics of textual resources. In Proceedings of the 2008 IEEE international conference on semantic computing, (pp. 1–9). Washington, DC: IEEE Computer Society.

  • Massoudi, K., Tsagkias, E., de Rijke, M., & Weerkamp, W. (2011). Incorporating query expansion and quality indicators in searching microblog posts. In ECIR 2011: 33rd European conference on information retrieval.

  • Meij, E., & de Rijke, M. (2010) Supervised query modeling using wikipedia. In SIGIR ’10: Proceedings of the 33rd annual international ACM SIGIR conference on research and development in information retrieval, ACM.

  • Meij, E., Trieschnigg, D., de Rijke, M., & Kraaij, W. (2010). Conceptual language models for domain-specific retrieval. Information Processing and Management, 46(4), 448–469.

    Article  Google Scholar 

  • Odijk, D., de Rooij, O., Peetz, M.-H., Pieters, T., de Rijke, M., & Snelders, S. (2012). Semantic document selection. Historical research on collections that Span multiple centuries. In Research and advanced technology for digital libraries—international conference on theory and practice of digital libraries, TPDL 2012, Cypres.

  • Ounis, I., de Rijke, M., Macdonald, C., Mishne, G., & Soboroff, I. (2006). Overview of the TREC-2006 blog track. In TREC 2006, Gaithersburg.

  • Peetz, M.-H., & de Rijke, M. (2013). Cognitive temporal document priors. In 34th European conference on information retrieval (ECIR’13).

  • Peetz, M.-H., Meij, E., de Rijke, M., & Weerkamp, W. (2012). Adaptive temporal query modeling. In ECIR 2012: 34th European conference on information retrieval.

  • Ponte, J. M., & Croft, W. B. (1998). A language modeling approach to information retrieval. In SIGIR ’98: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, pp. 275–281.

  • Pustejovsky, J., Castaño, J. M., Ingria, R., Sauri, R., Gaizauskas, R. J., Setzer, A., Katz, G., & Radev, D. R. (2003). Timeml: Robust specification of event and temporal expressions in text. In New directions in question answering, pp. 28–34.

  • Qiu, Y., & Frei, H.-P. (1993). Concept based query expansion. In SIGIR ’93: Proceedings of the 16th annual international ACM-SIGIR conference on research and development in Iinformation retrieval, ACM, pp. 160–169.

  • Rocchio, J. J. (1971). Relevance feedback in information retrieval. In G. Salton (Ed.), The SMART retrieval system—experiments in automatic document processing, (pp. 313–323). Prentice Hall, Englewood Cliffs, NJ.

  • Seki, K., Kino, Y., Sato, S., & Uehara, K. (2007). TREC 2007 blog track experiments at Kobe University. In TREC 2007.

  • Tsagkias, M., Weerkamp, W., & Rijke, M. (2010). News comments: Exploring, modeling, and online prediction. In C. Gurrin, Y. He, G. Kazai, U. Kruschwitz, S. Little, T. Roelleke, S. Rüger, & K. Rijsbergen (Eds.), Advances in information retrieval. Lecture notes in computer science (Vol. 5993, pp. 191–203). Berlin, Heidelberg: Springer.

  • Vendler, Z. (1957). Verbs and times. The Philosophical Review, 66(2).

  • Verhagen, M., & Pustejovsky, J. (2008). Temporal processing with the TARSQI toolkit. In 22nd international conference on on computational linguistics: Demonstration papers, COLING ’08, (pp. 189–192). Stroudsburg, PA: Association for Computational Linguistics.

  • Wang, X., Zhai, C., Hu, X., & Sproat, R. (2007). Mining correlated bursty topic patterns from coordinated text streams. In KDD ’07: The 13th ACM SIGKDD international conference on knowledge discovery and data mining.

  • Weerkamp, W., & de Rijke, M. (2008). Credibility improves topical blog post retrieval. In Proceedings of ACL-08: HLT, (pp. 923–931). Columbus, OH: ACL.

  • Weerkamp, W., & de Rijke, M. (2012). Credibility-inspired ranking for blog post retrieval. Information Retrieval Journal, 15(3–4), 243–277.

    Article  Google Scholar 

  • Weerkamp, W., Balog, K., & de Rijke, M. (2009). A generative blog post retrieval model that uses query expansion based on external collections. In Joint conference of the 47th annual meeting of the association for computational linguistics and the 4th international joint conference on natural language processing of the Asian Federation of Natural Language Processing (ACL-ICNLP 2009), pp. 1057–1065.

  • Weerkamp, W., Balog, K., & de Rijke, M. (2012). Exploiting external collections for query expansion. ACM Transactions on the Web, 6(4):Article 18.

  • Zhai, C., & Lafferty, J. (2001). Model-based feedback in the language modeling approach to information retrieval. In CIKM 01: Tenth international conference on information and knowledge management, pp. 403–410.

  • Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transaction on Information Systems, 22(2), 179–214.

    Article  Google Scholar 

  • Zhang, W., & Yu, C. (2006). UIC at TREC 2006 blog track. In TREC 2006.

Download references

Acknowledgments

We are grateful to our reviewers for providing valuable feedback and suggestions. This research was partially supported by the European Union’s ICT Policy Support Programme as part of the Competitiveness and Innovation Framework Programme, CIP ICT-PSP under grant agreement nr 250430, the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreements nr 258191 (PROMISE Network of Excellence) and 288024 (LiMoSINe project), the Netherlands Organisation for Scientific Research (NWO) under project nrs 612.061.814, 612.061.815, 640.004.802, 727.011.005, 612.001.116, HOR-11-10, the Center for Creation, Content and Technology (CCCT), the BILAND project funded by the CLARIN-nl program, the Dutch national program COMMIT, the ESF Research Network Program ELIAS, the Elite Network Shifts project funded by the Royal Dutch Academy of Sciences (KNAW), and the Netherlands eScience Center under project number 027.012.105.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maria-Hendrike Peetz.

Additional information

An earlier version of this article appeared as Peetz et al. (2012). In this substantially extended version we add a novel, non-uniform burst prior and carefully evaluate this new prior. We extend the query models presented in Peetz et al. (2012) with a new method to estimate the temporal distribution. We incorporate this method into the query modeling approach from Peetz et al. (2012) and compare it with algorithms for temporal information retrieval. What is also new is that we evaluate the influence of different test collections.

Appendices

Appendix 1: Query sets used

Recent-1

The query set used by Li and Croft (2003), named recent-1 in this work:

  • TREC-7, 8 test set: 346, 400, 301, 356, 311, 337, 389, 307, 326, 329, 316, 376, 357, 387, 320, 347;

  • TREC-7, 8 training set: 302, 304, 306, 319, 321, 330, 333, 334, 340, 345, 351, 352, 355, 370, 378, 382, 385, 391, 395, 396.

Recent-2

The query set used by Efron and Golovchinsky (Efron and Golovchinsky 2011), named recent-2 in this work:

  • TREC-2: 104, 116, 117, 122, 132, 133, 137, 139, 140, 148, 154, 164, 174, 175, 188, 192, 195, 196, 199, 200;

  • TREC-6, training set: 06, 307, 311, 316, 319, 320, 321, 324, 326, 329, 331, 334, 337, 339, 340, 345, 346;

  • TREC-7/TREC-8, test set: 351, 352, 357, 373, 376, 378, 387, 389, 391, 401, 404, 409, 410, 414, 416, 421, 428, 434, 437, 443, 445, 446, 449, 450.

Temporal

The query set used by Dakka et al. (Dakka et al. 2012), named temporal-t in this work:

  • TREC-6, training set: 301, 302, 306, 307, 311, 313, 315, 316, 318, 319, 320, 321, 322, 323, 324, 326, 329, 330, 331, 332, 333, 334, 337, 340, 341, 343, 345, 346, 347, 349, 350;

  • TREC-7, test set: 352, 354, 357, 358, 359, 360, 366, 368, 372, 374, 375, 376, 378, 383, 385, 388, 389, 390, 391, 392, 393, 395, 398, 399, 400;

  • TREC-8, test set: 401, 402, 404, 407, 408, 409, 410, 411, 412, 418, 420, 421, 422, 424, 425, 427, 428, 431, 432, 434, 435, 436, 437, 438, 439, 442, 443, 446, 448, 450.

Manually selected queries with an underlying temporal information need for TREC-Blog, named temporal-b in this work:

  • Blog06: 947, 943, 938, 937, 936, 933, 928, 925, 924, 923, 920, 919, 918, 917, 915, 914, 913, 907, 906, 905, 904, 903, 899, 897, 896, 895, 892, 891, 890, 888, 887, 886, 882, 881, 879, 875, 874, 871, 870, 869, 867, 865, 864, 862, 861, 860, 859, 858, 857, 856, 855, 854, 853, 851, 1050, 1043, 1040, 1034, 1032, 1030, 1029, 1028, 1026, 1024, 1021, 1020, 1019, 1017, 1016, 1015, 1014, 1012, 1011, 1009.

Appendix 2: Vendler classes of the queries

The classes are based on the verb classes introduced by Vendler (Vendler 1957).

TREC-2

  • State: 101, 102, 103, 106, 107, 109, 112, 113, 116, 117, 118, 120, 124, 126, 132, 133, 134, 135, 143, 147, 151, 153, 157, 158, 160, 161, 163, 166, 169, 171, 177, 179, 184, 185, 186, 189, 193, 194

  • Action: 104, 108, 115, 119, 123, 125, 136, 138, 139, 150, 152, 164, 165, 168, 173, 176

  • Achievement: 105, 114, 121, 122, 128, 130, 137, 141, 142, 145, 146, 155, 156, 159, 162, 167, 170, 172, 174, 180, 182, 183, 187, 188, 191, 192, 196, 197, 198

  • Accomplishment: 110, 111, 127, 129, 131, 140, 144, 148, 149, 154, 175, 178, 181, 190, 195, 199, 200

TREC-6

  • State: 302, 304, 305, 307, 308, 310, 313, 315, 316, 318, 320, 321, 333, 334, 335, 338, 339, 341, 344, 346, 348, 349, 350

  • Actions: 301, 312, 314, 319, 324, 325, 327, 330, 331, 340, 345, 347

  • Achievement: 303, 306, 309, 317, 329, 332, 337

  • Accomplishments: 311, 322, 323, 326, 328, 336, 342, 343

TREC-{7, 8}

  • State: 356, 359, 360, 361, 366, 368, 369, 370, 371, 372, 373, 377, 378, 379, 380, 383, 385, 387, 391, 392, 396, 401, 403, 413, 414, 415, 416, 417, 419, 420, 421, 423, 426, 427, 428, 432, 433, 434, 438, 441, 443, 444, 445, 446, 449

  • Actions: 351, 353, 357, 381, 382, 386, 388, 394, 399, 400, 402, 406, 407, 409, 411, 412, 418, 435, 437, 440, 448, 450

  • Achievement: 352, 355, 365, 376, 384, 390, 395, 398, 410, 425, 442

  • Accomplishments: 354, 358, 362, 363, 364, 367, 374, 375, 389, 393, 397, 404, 405, 408, 422, 424, 429, 430, 431, 436, 439, 447

Blog06

  • State: 851, 854, 855, 862, 863, 866, 872, 873, 877, 879, 880, 882, 883, 885, 888, 889, 891, 893, 894, 896, 897, 898, 899, 900, 901, 902, 903, 904, 908, 909, 910, 911, 912, 915, 916, 917, 918, 919, 920, 924, 926, 929, 930, 931, 934, 935, 937, 939, 940, 941, 944, 945, 946, 947, 948, 949, 950, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011, 1012, 1014, 1016, 1017, 1019, 1020, 1022, 1023, 1024, 1025, 1026, 1029, 1030, 1031, 1032, 1033, 1034, 1035, 1038, 1039, 1040, 1041, 1043, 1044, 1046, 1047, 1049, 1050

  • Action: 852, 853, 857, 858, 859, 860, 861, 864, 868, 869, 870, 871, 874, 875, 876, 881, 884, 886, 887, 890, 892, 895, 905, 906, 907, 913, 914, 921, 922, 925, 927, 928, 933, 936, 938, 942, 1001, 1018, 1021, 1036, 1037, 1045, 1048

  • Accomplishments: 865, 878, 932, 943, 1013, 1015, 1027

  • Achievement: 856, 867, 923, 1028, 1042

Appendix 3: Additional graphs

See Fig. 9.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Peetz, MH., Meij, E. & de Rijke, M. Using temporal bursts for query modeling. Inf Retrieval 17, 74–108 (2014). https://doi.org/10.1007/s10791-013-9227-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10791-013-9227-2

Keywords