The VLDB Journal

, Volume 20, Issue 4, pp 617–640 | Cite as

Efficient fuzzy full-text type-ahead search

Regular Paper

Abstract

Traditional information systems return answers after a user submits a complete query. Users often feel “left in the dark” when they have limited knowledge about the underlying data and have to use a try-and-see approach for finding information. A recent trend of supporting autocomplete in these systems is a first step toward solving this problem. In this paper, we study a new information-access paradigm, called “type-ahead search” in which the system searches the underlying data “on the fly” as the user types in query keywords. It extends autocomplete interfaces by allowing keywords to appear at different places in the underlying data. This framework allows users to explore data as they type, even in the presence of minor errors. We study research challenges in this framework for large amounts of data. Since each keystroke of the user could invoke a query on the backend, we need efficient algorithms to process each query within milliseconds. We develop various incremental-search algorithms for both single-keyword queries and multi-keyword queries, using previously computed and cached results in order to achieve a high interactive speed. We develop novel techniques to support fuzzy search by allowing mismatches between query keywords and answers. We have deployed several real prototypes using these techniques. One of them has been deployed to support type-ahead search on the UC Irvine people directory, which has been used regularly and well received by users due to its friendly interface and high efficiency.

Keywords

Auto complete Full-text search Type-ahead search Fuzzy search 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agrawal S., Chakrabarti K., Chaudhuri S., Ganti V.: Scalable ad-hoc entity extraction from text collections. PVLDB 1(1), 945–957 (2008)Google Scholar
  2. 2.
    Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In:VLDB, pp. 918–929 (2006)Google Scholar
  3. 3.
    Bao, Z., Ling, T.W., Chen, B., Lu, J.:Effective XML keyword search with relevance oriented ranking. In: ICDE, pp. 517–528 (2009)Google Scholar
  4. 4.
    Bast, H., Chitea, A., Suchanek, F.M., Weber, I.:Ester: efficient search on text, entities, and relations. In: SIGIR, pp. 671–678 (2007)Google Scholar
  5. 5.
    Bast, H., Mortensen, C.W., Weber, I.:Output-sensitive autocompletion search. In: SPIRE, pp. 150–162 (2006)Google Scholar
  6. 6.
    Bast, H., Weber, I.: Type less, find more: fast autocompletion search with a succinct index. In: SIGIR, pp. 364–371 (2006)Google Scholar
  7. 7.
    Bast, H., Weber I.: The completesearch engine: interactive, efficient, and towards IR & DB integration. In: CIDR, pp. 88–95 (2007)Google Scholar
  8. 8.
    Bayardo, R.J., Ma, Y., Srikant, R.:Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)Google Scholar
  9. 9.
    Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., Sudarshan, S.: Keyword searching and browsing in databases using BANKS. In: ICDE, pp. 431–440 (2002)Google Scholar
  10. 10.
    Celikik, M., Bast, H.: Fast error-tolerant search on very large texts. In: SAC, pp. 1724–1731 (2009)Google Scholar
  11. 11.
    Chakrabarti, K., Chaudhuri, S., Ganti, V., Xin, D.: An efficient filter for approximate membership checking. In: SIGMOD Conference, pp. 805–818 (2008)Google Scholar
  12. 12.
    Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD Conference, pp. 313–324 (2003)Google Scholar
  13. 13.
    Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, pp. 5–16 (2006)Google Scholar
  14. 14.
    Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: ICDE, pp. 865–876 (2005)Google Scholar
  15. 15.
    Chaudhuri, S., Kaushik, R.: Extending autocompletion to tolerate errors. In: SIGMOD Conference, pp. 707–718 (2009)Google Scholar
  16. 16.
    Cohen, S., Mamou, J., Kanza, Y., Sagiv, Y.: Xsearch: a semantic search engine for XML. In: VLDB, pp. 45–56 (2003)Google Scholar
  17. 17.
    Ding, B., Yu, J.X., Wang, S., Qin, L., Zhang, X., Lin, X.: Finding top-k min-cost connected trees in databases. In: ICDE, pp. 836–845 (2007)Google Scholar
  18. 18.
    Grabski, K., Scheffer, T.: Sentence completion. In: SIGIR, pp. 433–439 (2004)Google Scholar
  19. 19.
    Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)Google Scholar
  20. 20.
    Guo, L., Shao, F., Botev, C., Shanmugasundaram, J.: Xrank: ranked keyword search over XML documents. In: SIGMOD Conference, pp. 16–27 (2003)Google Scholar
  21. 21.
    Hadjieleftheriou, M., Chandel, A., Koudas, N., Srivastava, D.: Fast indexes and algorithms for set similarity selection queries. In: ICDE, pp. 267–276 (2008)Google Scholar
  22. 22.
    Hadjieleftheriou, M., Koudas, N., Srivastava, D.: Incremental maintenance of length normalized indexes for approximate string matching. In: SIGMOD Conference, pp. 429–440 (2009)Google Scholar
  23. 23.
    Hadjieleftheriou, M., Yu, X., Koudas, N., Srivastava, D.: Hashed samples: selectivity estimators for set similarity selection queries. In: VLDB (2008)Google Scholar
  24. 24.
    Hristidis, V., Gravano, L., Papakonstantinou, Y.: Efficient IR-style keyword search over relational databases. In: VLDB, pp. 850–861 (2003)Google Scholar
  25. 25.
    Hristidis, V., Papakonstantinou, Y.: Discover: keyword search in relational databases. In: VLDB, pp. 670–681 (2002)Google Scholar
  26. 26.
    Ji, S., Li, G., Li, C., Feng, J.: Efficient interactive fuzzy keyword search. In: WWW, pp. 371–380 (2009)Google Scholar
  27. 27.
    Jin L., Li C., Vernica R.: Sepia: estimating selectivities of approximate string predicates in large databases. VLDB J. 17(5), 1213–1229 (2008)CrossRefGoogle Scholar
  28. 28.
    Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., Karambelkar, H.: Bidirectional expansion for keyword search on graph databases. In: VLDB, pp. 505–516 (2005)Google Scholar
  29. 29.
    Kim, M.-S., Whang, K.-Y., Lee, J.-G., Lee, M.-J.: n-gram/2l: a space and time efficient two-level n-gram inverted index structure. In: VLDB, pp. 325–336 (2005)Google Scholar
  30. 30.
    Knuth D., The Art of Computer Programming, Sorting and Searching, third edition, Addison-Wesley (1998)Google Scholar
  31. 31.
    Kong, L., Gilleron, R., Lemay, A.: Retrieving meaningful relaxed tightest fragments for XML keyword search. In: EDBT, pp. 815–826 (2009)Google Scholar
  32. 32.
    Koudas, N., Li, C., Tung, A.K.H., Vernica, R.: Relaxing join and selection queries. In: VLDB, pp. 199–210 (2006)Google Scholar
  33. 33.
    Kukich K.: Techniques for automatically correcting words in text. ACM Comput. Surv. 24(4), 377–439 (1992)CrossRefGoogle Scholar
  34. 34.
    Lee, H., Ng, R.T., Shim, K.: Extending q-grams to estimate selectivity of string matching with low edit distance. In: VLDB, pp. 195–206 (2007)Google Scholar
  35. 35.
    Lee H., Ng R.T., Shim K.: Power-law based estimation of set similarity join size. PVLDB 2(1), 658–669 (2009)Google Scholar
  36. 36.
    Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)Google Scholar
  37. 37.
    Li, C., Wang, B., Yang, X.: VGRAM: improving performance of approximate queries on string collections using variable-length grams. In: VLDB, pp. 303–314 (2007)Google Scholar
  38. 38.
    Li, G., Feng, J., Wang, J., Zhou, L. Effective keyword search for valuable LCAs over XML documents. In: CIKM, pp. 31–40 (2007)Google Scholar
  39. 39.
    Li, G., Feng, J., Wang, J., Zhou, L.: KEMB: a keyword-based XML message broker. In: TKDE (2011)Google Scholar
  40. 40.
    Li G., Feng J., Zhou X., Wang J.: Providing built-in keyword search capabilities in RDBMS. VLDB J. 20(1), 1–19 (2011)CrossRefGoogle Scholar
  41. 41.
    Li, G., Feng, J., Zhou, L.: Interactive search in XML data. In: WWW, pp. 1063–1064 (2009)Google Scholar
  42. 42.
    Li, G., Ji, S., Li, C., Feng, J.: Efficient type-ahead search on relational data: a tastier approach. In: SIGMOD Conference, pp. 695–706. (2009)Google Scholar
  43. 43.
    Li, G., Ooi, B.C., Feng, J., Wang, J., Zhou, L.: Ease: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. In: SIGMOD Conference, pp. 903–914 (2008)Google Scholar
  44. 44.
    Li, G., Zhou, X., Feng, J., Wang, J.: Progressive keyword search in relational databases. In: ICDE (2009)Google Scholar
  45. 45.
    Li, Q., Moon, B.: Indexing and querying XML data for regular path expressions. In: VLDB, pp. 361–370 (2001)Google Scholar
  46. 46.
    Li, Y., Yu, C., Jagadish, H.V.: Schema-free xquery. In: VLDB, pp. 72–83. (2004)Google Scholar
  47. 47.
    Liu, Z., Chen, Y.: Identifying meaningful return information for XML keyword search. In: SIGMOD Conference, pp. 329–340 (2007)Google Scholar
  48. 48.
    Liu Z., Chen Y.: Reasoning and identifying relevant matches for XML keyword search. PVLDB 1(1), 921–932 (2008)Google Scholar
  49. 49.
    Motoda H., Yoshida K.: Machine learning techniques to make computers easier to use. Artif. Intell. 103(1–2), 295–321 (1998)MATHCrossRefGoogle Scholar
  50. 50.
    Nandi, A., Jagadish, H.V.: Effective phrase prediction. In: VLDB, pp. 219–230. (2007)Google Scholar
  51. 51.
    Qin L., Yu J., Chang L.: Scalable keyword search on large data streams. VLDB J. 20(1), 35–57 (2011)CrossRefGoogle Scholar
  52. 52.
    Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD Conference, pp. 743–754 (2004)Google Scholar
  53. 53.
    Shao F., Guo L., Botev C., Bhaskar A., Chettiar M., Yang F., Shanmugasundaram J.: Efficient keyword search over virtual XML views. VLDB J. 18(2), 543–570 (2009)CrossRefGoogle Scholar
  54. 54.
    Simitsis A., Koutrika G., Ioannidis Y.E.: Précis: from unstructured keywords as queries to structured databases as answers. VLDB J. 17(1), 117–149 (2008)Google Scholar
  55. 55.
    Sun, C., Chan, C.Y., Goenka, A.K.: Multiway sLCA-based keyword search in XML data. In: WWW, pp. 1043–1052 (2007)Google Scholar
  56. 56.
    Theobald M., Bast H., Majumdar D., Schenkel R., Weikum G.: Topx: efficient and versatile top-k query processing for semistructured data. VLDB J. 17(1), 81–115 (2008)Google Scholar
  57. 57.
    Wang, J., Li, G., Feng, J.: Automatic URL completion and prediction using fuzzy type-ahead search. In: SIGIR, pp. 634–635 (2009)Google Scholar
  58. 58.
    Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD Conference, pp. 759–770 (2009)Google Scholar
  59. 59.
    Williams H.E., Zobel J., Bahle D.: Fast phrase querying with combined indexes. ACM Trans. Inf. Syst. 22(4), 573–594 (2004)CrossRefGoogle Scholar
  60. 60.
    Xiao C., Wang W., Lin X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1), 933–944 (2008)Google Scholar
  61. 61.
    Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k set similarity joins. In: ICDE, pp. 916–927 (2009)Google Scholar
  62. 62.
    Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW (2008)Google Scholar
  63. 63.
    Xu, Y., Papakonstantinou, Y.: Efficient keyword search for smallest LCAs in XML databases. In: SIGMOD Conference, pp. 537–538 (2005)Google Scholar
  64. 64.
    Xu, Y., Papakonstantinou, Y.: Efficient LCA based keyword search in XML data. In: EDBT, pp. 535–546 (2008)Google Scholar
  65. 65.
    Yang, X., Wang, B., Li, C.: Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In: SIGMOD Conference (2008)Google Scholar

Copyright information

© Springer-Verlag 2011

Authors and Affiliations

  • Guoliang Li
    • 1
  • Shengyue Ji
    • 2
  • Chen Li
    • 2
  • Jianhua Feng
    • 1
  1. 1.Department of Computer ScienceTsinghua UniversityBeijingChina
  2. 2.Department of Computer ScienceUniversity of CaliforniaIrvineUSA

Personalised recommendations