Skip to main content
Log in

A unified framework for approximate dictionary-based entity extraction

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Dictionary-based entity extraction identifies predefined entities (e.g., person names or locations) from documents. A recent trend for improving extraction recall is to support approximate entity extraction, which finds all substrings from documents that approximately match entities in a given dictionary. Existing methods to address this problem support either token-based similarity (e.g., Jaccard Similarity) or character-based dissimilarity (e.g., Edit Distance). It calls for a unified method to support various similarity/dissimilarity functions, since a unified method can reduce the programing efforts, the hardware requirements, and the manpower. In this paper, we propose a unified framework to support various similarity/dissimilarity functions, such as jaccard similarity, cosine similarity, dice similarity, edit similarity, and edit distance. Since many real-world applications have high-performance requirement for approximate entity extraction on data streams (e.g., Twitter), we focus on devising efficient algorithms to achieve high performance. We find that many substrings in documents have overlaps, and we can utilize the shared computation across the overlaps to avoid unnecessary redundant computation. To this end, we propose efficient filtering algorithms and develop effective pruning techniques. Experimental results show our method achieves high performance and outperforms state-of-the-art studies significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30

Similar content being viewed by others

Notes

  1. In this paper, we omit the proof due to space constraints.

  2. In this paper, we take \(e\) and \(s\) as multisets, since there may exist duplicate tokens in entities and substrings of the document. Even if they are taken as sets, we can also use our method for extraction.

  3. For ease of presentation, we use a loser tree to represent a heap structure in our examples.

  4. Note that, we can get entity \(e\)’s token number \(|e|\) using a hash map, which keeps the pair of an entity and its token number, thus we can get the token number of an entity in \(\mathcal {O}(1)\) time complexity.

  5. As \(D[p_i\cdots p_j]\) may contain duplicate tokens, \(|P_{e}[i \cdots j]| \ge |e\cap s|\) and \(|P_{e}[i \cdots j]|\) may also be larger than \(|e|\).

  6. Notice that we do not consider finding candidates from candidate windows as such cost is same for any strategy.

  7. http://www.cse.unsw.edu.au/~weiw/project/simjoin.html.

  8. http://www.informatik.uni-trier.de/~ley/db.

  9. http://www.ncbi.nlm.nih.gov/pubmed.

  10. http://portal.acm.org.

  11. http://www.ncbi.nlm.nih.gov/genome.

References

  1. Agrawal, S., Chakrabarti, K., Chaudhuri, S., Ganti, V.: Scalable ad-hoc entity extraction from text collections. PVLDB 1(1), 945–957 (2008)

    Google Scholar 

  2. Arasu, A., Ganti, V., Kaushik, R.: Efficient exact setsimilarity joins. In: VLDB, pp. 918–929 (2006)

  3. Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In WWW, pp. 131–140 (2007)

  4. Chakrabarti, K., Chaudhuri, S., Ganti, V., Xin, D.: An efficient filter for approximate membership checking. In: SIGMOD Conference, pp. 805–818 (2008)

  5. Chandel, A., Nagesh, P. C., Sarawagi, S.: Efficient batch top-k search for dictionary-based entity recognition. In: ICDE, pp. 28 (2006)

  6. Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD Conference, pp. 313–324 (2003)

  7. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In ICDE, pp. 5–16 (2006)

  8. Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: ICDE, pp. 865–876 (2005)

  9. Chaudhuri, S., Ganti, V., Xin, D.: Mining document collections to facilitate accurate approximate entity matching. PVLDB 2(1), 395–406 (2009)

    Google Scholar 

  10. Deng, D., Li, G., Feng, J.: An efficient trie-based method for approximate entity extraction with editdistance constraints. In: ICDE, pp. 762–773 (2012)

  11. Deng, D., Li, G., Feng, J., Li, W.-S.: Top-k string similarity search with edit-distance constraints. In: ICDE, pp. 925–936 (2013)

  12. Feng, J., Wang, J., Li, G.: Trie-join: a trie-based method for efficient string similarity joins. VLDB J. 21(4), 437–461 (2012)

    Article  Google Scholar 

  13. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)

  14. Hadjieleftheriou, M., Chandel, A., Koudas, N., Srivastava, D.: Fast indexes and algorithms for set similarity selection queries. In: ICDE, pp. 267–276 (2008)

  15. Hadjieleftheriou, M., Koudas, N., Srivastava, D.: Incremental maintenance of length normalized indexes for approximate string matching. In: SIGMOD Conference, pp. 429–440 (2009)

  16. Hadjieleftheriou, M., Yu, X., Koudas, N., Srivastava, D.: Hashed samples: selectivity estimators for set similarity selection queries. PVLDB 1(1), 201–212 (2008)

    Google Scholar 

  17. Kim, M.-S., Whang, K.-Y., Lee, J.-G., Lee, M.-J.: ngram/ 2l: a space and time efficient two-level n-gram inverted index structure. In: VLDB, pp. 325–336 (2005)

  18. Koudas, N., Li, C., Tung, A.K.H., Vernica, R.: Relaxing join and selection queries. In: VLDB, pp. 199–210 (2006)

  19. Lee, H., Ng, R.T., Shim, K.: Extending q-grams to estimate selectivity of string matching with low edit distance. In: VLDB, pp. 195–206 (2007)

  20. Lee, H., Ng, R.T., Shim, K.: Power-law based estimation of set similarity join size. PVLDB 2(1), 658–669 (2009)

    Google Scholar 

  21. Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)

  22. Li, C., Wang, B., Yang, X.: Vgram: Improving performance of approximate queries on string collections using variable-length grams. In: VLDB, pp. 303–314 (2007)

  23. Li, G., Deng, D., Feng, J.: Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction. In: SIGMOD Conference, pp. 529–540 (2011)

  24. Li, G., Deng, D., Feng, J.: A partition-based method for string similarity joins with edit-distance constraints. ACM Trans. Database Syst. 38(2), 9 (2013)

    Article  MathSciNet  Google Scholar 

  25. Li, G., Deng, D., Wang, J., Feng, J.: Pass-join: a partition-based method for similarity joins. PVLDB 5(3), 253–264 (2011)

    Google Scholar 

  26. Lu, J., Han, J., Meng, X.: Efficient algorithms for approximate member extraction using signature-based inverted lists. In: CIKM, pp. 315–324 (2009)

  27. Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD Conference, pp. 743–754 (2004)

  28. Wang, J., Li, G., Feng, J.: Trie-join: efficient trie-based string similarity joins with edit-distance constraints. PVLDB 3(1), 1219–1230 (2010)

    Google Scholar 

  29. Wang, J., Li, G., Feng, J.: Fast-join: an efficient method for fuzzy token matching based string similarity join. In: ICDE, pp. 458–469 (2011)

  30. Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: SIGMOD conference, pp. 85–96 (2012)

  31. Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD Conference (2009)

  32. Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1), 933–944 (2008)

    Google Scholar 

  33. Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k set similarity joins. In: ICDE, pp. 916–927 (2009)

  34. Xiao, C., Wang, W., Lin, X. and Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW (2008)

Download references

Acknowledgments

This work was partly supported by the National Natural Science Foundation of China under Grant No. 61272090 and 61373024, National Grand Fundamental Research 973 Program of China under Grant No. 2011CB302206, Beijing Higher Education Young Elite Teacher Project under Grant No. YETP0105, a project of Tsinghua University under Grant No. 20111081073, Tsinghua-Tencent Joint Laboratory for Internet Innovation Technology, the “NExT Research Center” funded by MDA, Singapore, under Grant No. WBS:R-252-300-001-490, and the FDCT/106/2012/A3.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dong Deng.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Deng, D., Li, G., Feng, J. et al. A unified framework for approximate dictionary-based entity extraction. The VLDB Journal 24, 143–167 (2015). https://doi.org/10.1007/s00778-014-0367-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-014-0367-9

Keywords

Navigation