Semistructured Data Search

  • Krisztian Balog

Abstract

This paper presents a selection of methods for searching in heterogeneous data collections where some amount of structure is available. We start with a general retrieval framework, based on generative probabilistic modeling, for ranking unstructured document representations. Then, we consider structure at two different levels: documents and queries. For documents, the internal structure is captured through the use of multiple document fields, and various approaches to setting field weights are discussed. For queries, the focus is on effectively utilizing additional input data that the user might provide along with the keyword query, such as target categories or example documents. We place a particular emphasis on methods that are robust with respect to the availability of structured data and are able to deal with inconsistent or incomplete information.

Keywords

Semistructured data generative probabilistic models document modeling query modeling 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aslam, J.A., Montague, M.: Models for metasearch. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2001), pp. 276–284. ACM (2001)Google Scholar
  2. 2.
    Bai, J., Nie, J.-Y.: Adapting information retrieval to query contexts. Inf. Process. Manage. 44(6), 1901–1922 (2008)CrossRefGoogle Scholar
  3. 3.
    Bailey, P., Craswell, N., de Vries, A.P., Soboroff, I.: Overview of the TREC 2007 enterprise track. In: The Sixteenth Text REtrieval Conference Proceedings (TREC 2007). NIST Special Publication 500-274 (2008)Google Scholar
  4. 4.
    Balog, K., Weerkamp, W., de Rijke, M.: A few examples go a long way: constructing query models from elaborate query formulations. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008), pp. 371–378. ACM (2008)Google Scholar
  5. 5.
    Balog, K., de Vries, A.P., Serdyukov, P., Thomas, P., Westerveld, T.: Overview of the TREC 2009 entity track. In: Proceedings of the Eighteenth Text REtrieval Conference (TREC 2009), NIST Special Publication 500-278 (February 2010)Google Scholar
  6. 6.
    Balog, K., Bron, M., De Rijke, M.: Query modeling for entity search based on terms, categories, and examples. ACM Trans. Inf. Syst. 29(4), 22:1–22:31 (2011)Google Scholar
  7. 7.
    Balog, K., Serdyukov, P., de Vries, A.P.: Overview of the TREC 2011 entity track. In: The Twentieth Text REtrieval Conference Proceedings (TREC 2011). NIST Special Publication 500-296 (February 2012)Google Scholar
  8. 8.
    Bizer, C., Heath, T., Berners-Lee, T.: Linked data - the story so far. Int. J. Semantic Web Inf. Syst. 5(3), 1–22 (2009)CrossRefGoogle Scholar
  9. 9.
    Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.: Dbpedia - a crystallization point for the web of data. Web Semant. 7(3), 154–165 (2009)CrossRefGoogle Scholar
  10. 10.
    Blanco, R., Mika, P., Vigna, S.: Effective and efficient entity search in RDF data. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 83–97. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  11. 11.
    Carpineto, C., Romano, G.: A survey of automatic query expansion in information retrieval. ACM Comput. Surv. 44(1), 1:1–1:50 (2012)Google Scholar
  12. 12.
    Dalton, J., Huston, S.: Semantic entity retrieval using web queries over structured RDF data. In: Proceedings of the 3rd International Semantic Search Workshop, SEMSEARCH 2010 (2010)Google Scholar
  13. 13.
    Das-Gupta, P., Katzer, J.: A study of the overlap among document representations. In: Proceedings of the 6th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1983), pp. 106–114. ACM (1983)Google Scholar
  14. 14.
    de Vries, A.P., Vercoustre, A.-M., Thom, J.A., Craswell, N., Lalmas, M.: Overview of the INEX 2007 entity ranking track. In: Fuhr, et al. (eds.) [19], pp. 245–251Google Scholar
  15. 15.
    Demartini, G., Firan, C.S., Iofciu, T.: L3S at INEX 2007: Query expansion for entity ranking using a highly accurate ontology. In: Fuhr, et al. (eds.) [19], pp. 252–263Google Scholar
  16. 16.
    Demartini, G., Firan, C.S., Iofciu, T., Krestel, R., Nejdl, W.: Why finding entities in Wikipedia is difficult, sometimes. Inf. Retr. 13(5), 534–567 (2010)CrossRefGoogle Scholar
  17. 17.
    Egozi, O., Markovitch, S., Gabrilovich, E.: Concept-based information retrieval using explicit semantic analysis. ACM Trans. Inf. Syst. 29(2), 8:1–8:34 (2011)Google Scholar
  18. 18.
    Fisher, H.L., Elchesen, D.R.: Effectiveness of combining title words and index terms in machine retrieval searches. Nature 238, 109–110 (1972)CrossRefGoogle Scholar
  19. 19.
    Fuhr, N., Kamps, J., Lalmas, M., Trotman, A. (eds.): INEX 2007. LNCS, vol. 4862. Springer, Heidelberg (2008)Google Scholar
  20. 20.
    Jämsen, J., Näppilä, T., Arvola, P.: Entity ranking based on category expansion. In: Fuhr, et al. (eds.) [19], pp. 264–278Google Scholar
  21. 21.
    Kamps, J., Mishne, G., de Rijke, M.: Language models for searching in Web corpora. In: The Thirteenth Text REtrieval Conference Proceedings (TREC 2004). NIST Special Publication 500-261 (2005)Google Scholar
  22. 22.
    Kaptein, R., Kamps, J.: Exploiting the category structure of Wikipedia for entity ranking. Artif. Intell. 194, 111–129 (2013)CrossRefMATHGoogle Scholar
  23. 23.
    Kim, J., Croft, W.B.: Ranking using multiple document types in desktop search. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2010), pp. 50–57. ACM (2010)Google Scholar
  24. 24.
    Kim, J., Xue, X., Croft, W.B.: A probabilistic retrieval model for semistructured data. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 228–239. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  25. 25.
    Lafferty, J., Zhai, C.: Document language models, query models, and risk minimization for information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2001), pp. 111–119. ACM (2001)Google Scholar
  26. 26.
    Lalmas, M., Baeza-Yates, R.: Structured text retrieval. In: Modern Information Retrieval - The Concepts and Technology Behind Search, 2nd edn. Pearson Education Ltd., Harlow (2011)Google Scholar
  27. 27.
    Lavrenko, V.: A Generative Theory of Relevance. The Information Retrieval Series, vol. 26. Springer, Heidelberg (2008)MATHGoogle Scholar
  28. 28.
    Lavrenko, V., Croft, W.B.: Relevance based language models. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2001), pp. 120–127. ACM (2001)Google Scholar
  29. 29.
    Lu, W., Robertson, S., MacFarlane, A.: Field-weighted XML retrieval based on BM25. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 161–171. Springer, Heidelberg (2006)Google Scholar
  30. 30.
    Montague, M., Aslam, J.A.: Condorcet fusion for improved retrieval. In: Proceedings of the 11th International Conference on Information and Knowledge Management (CIKM 2002), pp. 538–548. ACM (2002)Google Scholar
  31. 31.
    Neumayer, R., Balog, K., Nørvåg, K.: On the modeling of entities for ad-hoc entity search in the web of data. In: Baeza-Yates, R., de Vries, A.P., Zaragoza, H., Cambazoglu, B.B., Murdock, V., Lempel, R., Silvestri, F. (eds.) ECIR 2012. LNCS, vol. 7224, pp. 133–145. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  32. 32.
    Neumayer, R., Balog, K., Nørvåg, K.: When simple is (more than) good enough: Effective semantic search with (almost) no semantics. In: Baeza-Yates, R., de Vries, A.P., Zaragoza, H., Cambazoglu, B.B., Murdock, V., Lempel, R., Silvestri, F. (eds.) ECIR 2012. LNCS, vol. 7224, pp. 540–543. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  33. 33.
    Ogilvie, P., Callan, J.: Combining document representations for known-item search. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2003), pp. 143–150. ACM (2003)Google Scholar
  34. 34.
    Ogilvie, P., Callan, J.: Hierarchical language models for XML component retrieval. In: Fuhr, N., Lalmas, M., Malik, S., Szlávik, Z. (eds.) INEX 2004. LNCS, vol. 3493, pp. 224–237. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  35. 35.
    Pehcevski, J., Thom, J.A., Vercoustre, A.-M., Naumovski, V.: Entity ranking in Wikipedia: utilising categories, links and topic difficulty prediction. Inf. Retr. 13(5), 568–600 (2010)CrossRefGoogle Scholar
  36. 36.
    Plachouras, V., Ounis, I.: Multinomial randomness models for retrieval with document fields. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 28–39. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  37. 37.
    Pound, J., Hudek, A.K., Ilyas, I.F., Weddell, G.: Interpreting keyword queries over web knowledge bases. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM 2012), pp. 305–314. ACM (2012)Google Scholar
  38. 38.
    Robertson, S., Zaragoza, H., Taylor, M.: Simple BM25 extension to multiple weighted fields. In: Proceedings of the 13th ACM International Conference on Information and Knowledge Management (CIKM 2004), pp. 42–49. ACM (2004)Google Scholar
  39. 39.
    Rocchio, J.J.: Relevance feedback in information retrieval. In: Salton, G. (ed.) The SMART Retrieval System: Experiments in Automatic Document Processing, pp. 313–323. Prentice-Hall, Inc. (1971)Google Scholar
  40. 40.
    Sawant, U., Chakrabarti, S.: Learning joint query interpretation and response ranking. In: Proceedings of the 22nd International Conference on World Wide Web (WWW 2013), pp. 1099–1110. International World Wide Web Conferences Steering Committee (2013)Google Scholar
  41. 41.
    Tao, T., Zhai, C.: Regularized estimation of mixture models for robust pseudo-relevance feedback. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006), pp. 162–169. ACM (2006)Google Scholar
  42. 42.
    Thom, J., Pehcevski, J., Vercoustre, A.-M.: Use of Wikipedia categories in entity ranking. In: The 12th Australasian Document Computing Symposium, ADCS 2007 (2007)Google Scholar
  43. 43.
    Weerkamp, W., Balog, K., de Rijke, M.: Exploiting external collections for query expansion. ACM Trans. Web 6(4), 18:1–18:29 (2012)Google Scholar
  44. 44.
    Westerveld, T., Vries, A., Jong, F.: Generative probabilistic models. In: Blanken, H.M., Blok, H.E., Feng, L., Vries, A.P. (eds.) Multimedia Retrieval. Data-Centric Systems and Applications, pp. 177–198. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  45. 45.
    Zhai, C.: Statistical language models for information retrieval: a critical review. Found. Trends Inf. Retr. 2, 137–213 (2008)CrossRefGoogle Scholar
  46. 46.
    Zhai, C., Lafferty, J.: Model-based feedback in the language modeling approach to information retrieval. In: Proceedings of the 10th International Conference on Information and Knowledge Management (CIKM 2001), pp. 403–410. ACM (2001)Google Scholar
  47. 47.
    Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. 22, 179–214 (2004)CrossRefGoogle Scholar
  48. 48.
    Zhu, J., Song, D., Rüger, S.: Integrating document features for entity ranking. In: Fuhr, et al. (eds.) [19], pp. 336–347Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Krisztian Balog
    • 1
  1. 1.University of StavangerStavangerNorway

Personalised recommendations