A unified framework for approximate dictionary-based entity extraction

Deng, Dong; Li, Guoliang; Feng, Jianhua; Duan, Yi; Gong, Zhiguo

doi:10.1007/s00778-014-0367-9

A unified framework for approximate dictionary-based entity extraction

Regular Paper
Published: 05 August 2014

Volume 24, pages 143–167, (2015)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Dong Deng¹,
Guoliang Li¹,
Jianhua Feng¹,
Yi Duan² &
…
Zhiguo Gong³

2032 Accesses
22 Citations
Explore all metrics

Abstract

Dictionary-based entity extraction identifies predefined entities (e.g., person names or locations) from documents. A recent trend for improving extraction recall is to support approximate entity extraction, which finds all substrings from documents that approximately match entities in a given dictionary. Existing methods to address this problem support either token-based similarity (e.g., Jaccard Similarity) or character-based dissimilarity (e.g., Edit Distance). It calls for a unified method to support various similarity/dissimilarity functions, since a unified method can reduce the programing efforts, the hardware requirements, and the manpower. In this paper, we propose a unified framework to support various similarity/dissimilarity functions, such as jaccard similarity, cosine similarity, dice similarity, edit similarity, and edit distance. Since many real-world applications have high-performance requirement for approximate entity extraction on data streams (e.g., Twitter), we focus on devising efficient algorithms to achieve high performance. We find that many substrings in documents have overlaps, and we can utilize the shared computation across the overlaps to avoid unnecessary redundant computation. To this end, we propose efficient filtering algorithms and develop effective pruning techniques. Experimental results show our method achieves high performance and outperforms state-of-the-art studies significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Integrating Approximate String Matching with Phonetic String Similarity

Efficient Approximate Entity Matching Using Jaro-Winkler Distance

Entity Matching with String Transformation and Similarity-Based Features

Notes

In this paper, we omit the proof due to space constraints.
In this paper, we take \(e\) and \(s\) as multisets, since there may exist duplicate tokens in entities and substrings of the document. Even if they are taken as sets, we can also use our method for extraction.
For ease of presentation, we use a loser tree to represent a heap structure in our examples.
Note that, we can get entity \(e\)’s token number \(|e|\) using a hash map, which keeps the pair of an entity and its token number, thus we can get the token number of an entity in \(\mathcal {O}(1)\) time complexity.
As \(D[p_i\cdots p_j]\) may contain duplicate tokens, \(|P_{e}[i \cdots j]| \ge |e\cap s|\) and \(|P_{e}[i \cdots j]|\) may also be larger than \(|e|\).
Notice that we do not consider finding candidates from candidate windows as such cost is same for any strategy.
http://www.cse.unsw.edu.au/~weiw/project/simjoin.html.
http://www.informatik.uni-trier.de/~ley/db.
http://www.ncbi.nlm.nih.gov/pubmed.
http://portal.acm.org.
http://www.ncbi.nlm.nih.gov/genome.

References

Agrawal, S., Chakrabarti, K., Chaudhuri, S., Ganti, V.: Scalable ad-hoc entity extraction from text collections. PVLDB 1(1), 945–957 (2008)
Google Scholar
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact setsimilarity joins. In: VLDB, pp. 918–929 (2006)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In WWW, pp. 131–140 (2007)
Chakrabarti, K., Chaudhuri, S., Ganti, V., Xin, D.: An efficient filter for approximate membership checking. In: SIGMOD Conference, pp. 805–818 (2008)
Chandel, A., Nagesh, P. C., Sarawagi, S.: Efficient batch top-k search for dictionary-based entity recognition. In: ICDE, pp. 28 (2006)
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD Conference, pp. 313–324 (2003)
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In ICDE, pp. 5–16 (2006)
Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: ICDE, pp. 865–876 (2005)
Chaudhuri, S., Ganti, V., Xin, D.: Mining document collections to facilitate accurate approximate entity matching. PVLDB 2(1), 395–406 (2009)
Google Scholar
Deng, D., Li, G., Feng, J.: An efficient trie-based method for approximate entity extraction with editdistance constraints. In: ICDE, pp. 762–773 (2012)
Deng, D., Li, G., Feng, J., Li, W.-S.: Top-k string similarity search with edit-distance constraints. In: ICDE, pp. 925–936 (2013)
Feng, J., Wang, J., Li, G.: Trie-join: a trie-based method for efficient string similarity joins. VLDB J. 21(4), 437–461 (2012)
Article Google Scholar
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)
Hadjieleftheriou, M., Chandel, A., Koudas, N., Srivastava, D.: Fast indexes and algorithms for set similarity selection queries. In: ICDE, pp. 267–276 (2008)
Hadjieleftheriou, M., Koudas, N., Srivastava, D.: Incremental maintenance of length normalized indexes for approximate string matching. In: SIGMOD Conference, pp. 429–440 (2009)
Hadjieleftheriou, M., Yu, X., Koudas, N., Srivastava, D.: Hashed samples: selectivity estimators for set similarity selection queries. PVLDB 1(1), 201–212 (2008)
Google Scholar
Kim, M.-S., Whang, K.-Y., Lee, J.-G., Lee, M.-J.: ngram/ 2l: a space and time efficient two-level n-gram inverted index structure. In: VLDB, pp. 325–336 (2005)
Koudas, N., Li, C., Tung, A.K.H., Vernica, R.: Relaxing join and selection queries. In: VLDB, pp. 199–210 (2006)
Lee, H., Ng, R.T., Shim, K.: Extending q-grams to estimate selectivity of string matching with low edit distance. In: VLDB, pp. 195–206 (2007)
Lee, H., Ng, R.T., Shim, K.: Power-law based estimation of set similarity join size. PVLDB 2(1), 658–669 (2009)
Google Scholar
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)
Li, C., Wang, B., Yang, X.: Vgram: Improving performance of approximate queries on string collections using variable-length grams. In: VLDB, pp. 303–314 (2007)
Li, G., Deng, D., Feng, J.: Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction. In: SIGMOD Conference, pp. 529–540 (2011)
Li, G., Deng, D., Feng, J.: A partition-based method for string similarity joins with edit-distance constraints. ACM Trans. Database Syst. 38(2), 9 (2013)
Article MathSciNet Google Scholar
Li, G., Deng, D., Wang, J., Feng, J.: Pass-join: a partition-based method for similarity joins. PVLDB 5(3), 253–264 (2011)
Google Scholar
Lu, J., Han, J., Meng, X.: Efficient algorithms for approximate member extraction using signature-based inverted lists. In: CIKM, pp. 315–324 (2009)
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD Conference, pp. 743–754 (2004)
Wang, J., Li, G., Feng, J.: Trie-join: efficient trie-based string similarity joins with edit-distance constraints. PVLDB 3(1), 1219–1230 (2010)
Google Scholar
Wang, J., Li, G., Feng, J.: Fast-join: an efficient method for fuzzy token matching based string similarity join. In: ICDE, pp. 458–469 (2011)
Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: SIGMOD conference, pp. 85–96 (2012)
Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD Conference (2009)
Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1), 933–944 (2008)
Google Scholar
Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k set similarity joins. In: ICDE, pp. 916–927 (2009)
Xiao, C., Wang, W., Lin, X. and Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW (2008)

Download references

Acknowledgments

This work was partly supported by the National Natural Science Foundation of China under Grant No. 61272090 and 61373024, National Grand Fundamental Research 973 Program of China under Grant No. 2011CB302206, Beijing Higher Education Young Elite Teacher Project under Grant No. YETP0105, a project of Tsinghua University under Grant No. 20111081073, Tsinghua-Tencent Joint Laboratory for Internet Innovation Technology, the “NExT Research Center” funded by MDA, Singapore, under Grant No. WBS:R-252-300-001-490, and the FDCT/106/2012/A3.

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
Dong Deng, Guoliang Li & Jianhua Feng
School of Software, Beihang University, Beijing, China
Yi Duan
University of Macau, Macau, China
Zhiguo Gong

Authors

Dong Deng
View author publications
You can also search for this author in PubMed Google Scholar
Guoliang Li
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Feng
View author publications
You can also search for this author in PubMed Google Scholar
Yi Duan
View author publications
You can also search for this author in PubMed Google Scholar
Zhiguo Gong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dong Deng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Deng, D., Li, G., Feng, J. et al. A unified framework for approximate dictionary-based entity extraction. The VLDB Journal 24, 143–167 (2015). https://doi.org/10.1007/s00778-014-0367-9

Download citation

Received: 12 November 2013
Revised: 28 April 2014
Accepted: 11 July 2014
Published: 05 August 2014
Issue Date: February 2015
DOI: https://doi.org/10.1007/s00778-014-0367-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A unified framework for approximate dictionary-based entity extraction

Abstract

Access this article

Similar content being viewed by others

Integrating Approximate String Matching with Phonetic String Similarity

Efficient Approximate Entity Matching Using Jaro-Winkler Distance

Entity Matching with String Transformation and Similarity-Based Features

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A unified framework for approximate dictionary-based entity extraction

Abstract

Access this article

Similar content being viewed by others

Integrating Approximate String Matching with Phonetic String Similarity

Efficient Approximate Entity Matching Using Jaro-Winkler Distance

Entity Matching with String Transformation and Similarity-Based Features

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation