The VLDB Journal

, Volume 24, Issue 1, pp 143–167 | Cite as

A unified framework for approximate dictionary-based entity extraction

  • Dong Deng
  • Guoliang Li
  • Jianhua Feng
  • Yi Duan
  • Zhiguo Gong
Regular Paper

Abstract

Dictionary-based entity extraction identifies predefined entities (e.g., person names or locations) from documents. A recent trend for improving extraction recall is to support approximate entity extraction, which finds all substrings from documents that approximately match entities in a given dictionary. Existing methods to address this problem support either token-based similarity (e.g., Jaccard Similarity) or character-based dissimilarity (e.g., Edit Distance). It calls for a unified method to support various similarity/dissimilarity functions, since a unified method can reduce the programing efforts, the hardware requirements, and the manpower. In this paper, we propose a unified framework to support various similarity/dissimilarity functions, such as jaccard similarity, cosine similarity, dice similarity, edit similarity, and edit distance. Since many real-world applications have high-performance requirement for approximate entity extraction on data streams (e.g., Twitter), we focus on devising efficient algorithms to achieve high performance. We find that many substrings in documents have overlaps, and we can utilize the shared computation across the overlaps to avoid unnecessary redundant computation. To this end, we propose efficient filtering algorithms and develop effective pruning techniques. Experimental results show our method achieves high performance and outperforms state-of-the-art studies significantly.

Keywords

Approximate entity extraction Unified framework Filtering algorithms Pruning techniques 

References

  1. 1.
    Agrawal, S., Chakrabarti, K., Chaudhuri, S., Ganti, V.: Scalable ad-hoc entity extraction from text collections. PVLDB 1(1), 945–957 (2008)Google Scholar
  2. 2.
    Arasu, A., Ganti, V., Kaushik, R.: Efficient exact setsimilarity joins. In: VLDB, pp. 918–929 (2006)Google Scholar
  3. 3.
    Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In WWW, pp. 131–140 (2007)Google Scholar
  4. 4.
    Chakrabarti, K., Chaudhuri, S., Ganti, V., Xin, D.: An efficient filter for approximate membership checking. In: SIGMOD Conference, pp. 805–818 (2008)Google Scholar
  5. 5.
    Chandel, A., Nagesh, P. C., Sarawagi, S.: Efficient batch top-k search for dictionary-based entity recognition. In: ICDE, pp. 28 (2006)Google Scholar
  6. 6.
    Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD Conference, pp. 313–324 (2003)Google Scholar
  7. 7.
    Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In ICDE, pp. 5–16 (2006)Google Scholar
  8. 8.
    Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: ICDE, pp. 865–876 (2005)Google Scholar
  9. 9.
    Chaudhuri, S., Ganti, V., Xin, D.: Mining document collections to facilitate accurate approximate entity matching. PVLDB 2(1), 395–406 (2009)Google Scholar
  10. 10.
    Deng, D., Li, G., Feng, J.: An efficient trie-based method for approximate entity extraction with editdistance constraints. In: ICDE, pp. 762–773 (2012)Google Scholar
  11. 11.
    Deng, D., Li, G., Feng, J., Li, W.-S.: Top-k string similarity search with edit-distance constraints. In: ICDE, pp. 925–936 (2013)Google Scholar
  12. 12.
    Feng, J., Wang, J., Li, G.: Trie-join: a trie-based method for efficient string similarity joins. VLDB J. 21(4), 437–461 (2012)CrossRefGoogle Scholar
  13. 13.
    Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)Google Scholar
  14. 14.
    Hadjieleftheriou, M., Chandel, A., Koudas, N., Srivastava, D.: Fast indexes and algorithms for set similarity selection queries. In: ICDE, pp. 267–276 (2008)Google Scholar
  15. 15.
    Hadjieleftheriou, M., Koudas, N., Srivastava, D.: Incremental maintenance of length normalized indexes for approximate string matching. In: SIGMOD Conference, pp. 429–440 (2009)Google Scholar
  16. 16.
    Hadjieleftheriou, M., Yu, X., Koudas, N., Srivastava, D.: Hashed samples: selectivity estimators for set similarity selection queries. PVLDB 1(1), 201–212 (2008)Google Scholar
  17. 17.
    Kim, M.-S., Whang, K.-Y., Lee, J.-G., Lee, M.-J.: ngram/ 2l: a space and time efficient two-level n-gram inverted index structure. In: VLDB, pp. 325–336 (2005)Google Scholar
  18. 18.
    Koudas, N., Li, C., Tung, A.K.H., Vernica, R.: Relaxing join and selection queries. In: VLDB, pp. 199–210 (2006)Google Scholar
  19. 19.
    Lee, H., Ng, R.T., Shim, K.: Extending q-grams to estimate selectivity of string matching with low edit distance. In: VLDB, pp. 195–206 (2007)Google Scholar
  20. 20.
    Lee, H., Ng, R.T., Shim, K.: Power-law based estimation of set similarity join size. PVLDB 2(1), 658–669 (2009)Google Scholar
  21. 21.
    Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)Google Scholar
  22. 22.
    Li, C., Wang, B., Yang, X.: Vgram: Improving performance of approximate queries on string collections using variable-length grams. In: VLDB, pp. 303–314 (2007)Google Scholar
  23. 23.
    Li, G., Deng, D., Feng, J.: Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction. In: SIGMOD Conference, pp. 529–540 (2011)Google Scholar
  24. 24.
    Li, G., Deng, D., Feng, J.: A partition-based method for string similarity joins with edit-distance constraints. ACM Trans. Database Syst. 38(2), 9 (2013)CrossRefMathSciNetGoogle Scholar
  25. 25.
    Li, G., Deng, D., Wang, J., Feng, J.: Pass-join: a partition-based method for similarity joins. PVLDB 5(3), 253–264 (2011)Google Scholar
  26. 26.
    Lu, J., Han, J., Meng, X.: Efficient algorithms for approximate member extraction using signature-based inverted lists. In: CIKM, pp. 315–324 (2009)Google Scholar
  27. 27.
    Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD Conference, pp. 743–754 (2004)Google Scholar
  28. 28.
    Wang, J., Li, G., Feng, J.: Trie-join: efficient trie-based string similarity joins with edit-distance constraints. PVLDB 3(1), 1219–1230 (2010)Google Scholar
  29. 29.
    Wang, J., Li, G., Feng, J.: Fast-join: an efficient method for fuzzy token matching based string similarity join. In: ICDE, pp. 458–469 (2011)Google Scholar
  30. 30.
    Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: SIGMOD conference, pp. 85–96 (2012)Google Scholar
  31. 31.
    Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD Conference (2009)Google Scholar
  32. 32.
    Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1), 933–944 (2008)Google Scholar
  33. 33.
    Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k set similarity joins. In: ICDE, pp. 916–927 (2009)Google Scholar
  34. 34.
    Xiao, C., Wang, W., Lin, X. and Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Dong Deng
    • 1
  • Guoliang Li
    • 1
  • Jianhua Feng
    • 1
  • Yi Duan
    • 2
  • Zhiguo Gong
    • 3
  1. 1.Department of Computer Science and TechnologyTsinghua UniversityBeijingChina
  2. 2.School of SoftwareBeihang UniversityBeijingChina
  3. 3.University of MacauMacauChina

Personalised recommendations