Entity Recognition for Duplicate Filtering

Cordero Cruz, J. A.; Garza, Sara E.; Schaeffer, S. E.

doi:10.1007/978-3-319-11988-5_24

Entity Recognition for Duplicate Filtering

J. A. Cordero Cruz¹⁸,
Sara E. Garza¹⁸ &
S. E. Schaeffer^18,19,20

Conference paper

950 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8821))

Abstract

We propose a system for automatic detection of duplicate entries in a repository of semi-structured text documents. The proposed system employs text-entity recognition to extract information regarding time, location, names of persons and organizations, as well as events described within the document content. With structured representations of the content, called “metamodels”, we group the entries into clusters based on the similarity of the contents. Then we apply machine-learning algorithms to the clusters to carry out duplicate detection. We present results regarding precision, recall, and F-value of the proposed system.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agarwal, P., Vaithiyanathan, R., Sharma, S., Shroff, G.: Catching the long-tail: Extracting local news events from Twitter. In: Proc. of the 6th International AAAI Conference on Weblogs and Social Media, Palo Alto, CA, USA, pp. 379–382. AAAI (2012)
Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval. ACM Press, New York (1999)
Google Scholar
Bishop, C.: Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, New York (2006)
MATH Google Scholar
Church, K.: A stochastic parts program and noun phrase parser for unrestricted text. In: Proc. of the 2nd Conference on Applied Natural Language Processing, Stroudsburg, PA, USA, pp. 136–143. Association for Computational Linguistics (1988)
Google Scholar
Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. Wiley-Interscience, New York (2000)
Google Scholar
Esposito, R., Radicioni, D.: CarpeDiem: Optimizing the Viterbi algorithm and applications to supervised sequential learning. The Journal of Machine Learning Research 10, 1851–1880 (2009)
MathSciNet MATH Google Scholar
Gong, C., Huang, Y., Cheng, X., Bai, S.: Detecting near-duplicates in large-scale short text databases. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 877–883. Springer, Heidelberg (2008)
Chapter Google Scholar
Hall, P., Dowling, G.: Approximate string matching. ACM Computing Surveys 12(4), 381–402 (1980)
Article MathSciNet Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer Series in Statistics. Springer, New York (2009)
Book Google Scholar
Jurafsky, D., Martin, J.: Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall PTR, Upper Saddle River (2000)
Google Scholar
Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10, 707 (1966)
MathSciNet Google Scholar
Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)
Google Scholar
Ponomareva, N., Rosso, P., Pla, F., Molina, A.: Conditional random fields vs. hidden Markov models in a biomedical named entity recognition task. In: Proc. of International Conference Recent Advances in Natural Language Processing, Borovets, Bulgaria, pp. 479–483. RANLP 2007 Organising Committee (2007)
Google Scholar
Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: Proc. of the 13th Conference on Computational Natural Language Learning, Stroudsburg, PA, USA, pp. 147–155. Association for Computational Linguistics (2009)
Google Scholar
Ritter, A., Clark, S., Mausam, Etzioni, O.: Named entity recognition in tweets: An experimental study. In: Proc. of the Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA, pp. 1524–1534. Association for Computational Linguistics (2011)
Google Scholar
Ross, S.: A first course in probability. Pearson Prentice Hall, Harlow (2010)
Google Scholar
Sankaranarayanan, J., Samet, H., Teitler, B., Lieberman, M., Sperling, J.: TwitterStand: News in tweets. In: Proc. of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 42–51. ACM, New York (2009)
Google Scholar
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proc. of the 2003 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, pp. 76–85. ACM (2003)
Google Scholar
Tao, K., Abel, F., Hauff, C., Houben, G.-J., Gadiraju, U.: Groundhog Day: Near-duplicate Detection on Twitter. In: Proc. of the 22nd International Conference on World Wide Web, Republic and Canton of Geneva, Switzerland, pp. 1273–1284. International World Wide Web Conferences Steering Committee (2013)
Google Scholar
Weis, M., Naumann, F.: DogmatiX tracks down duplicates in xml. In: Proc. of the 2005 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, pp. 431–442. ACM (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

FIME, UANL, Cd. Universitaria, San Nicolás de los Garza, NL, Mexico
J. A. Cordero Cruz, Sara E. Garza & S. E. Schaeffer
CIIDIT, UANL, PIIT Monterrey, NL, Mexico
S. E. Schaeffer
HIIT, University of Helsinki, Kumpula Campus, Helsinki, Finland
S. E. Schaeffer

Authors

J. A. Cordero Cruz
View author publications
You can also search for this author in PubMed Google Scholar
Sara E. Garza
View author publications
You can also search for this author in PubMed Google Scholar
S. E. Schaeffer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of São Paulo, São Carlos, Brazil
Agma Juci Machado Traina
University of Sao Paulo at Sao Carlos - USP, Av. do Trabalhador Saocarlense 400, 13566-590, Sao Carlos, Brazil
Caetano Traina Jr.
University of Sal Paulo at Sao Carlos - USP, Av. do Trabalhador Saocarlense 400, 13566-590, Sao Carlos, Brazil
Robson Leonardo Ferreira Cordeiro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cordero Cruz, J.A., Garza, S.E., Schaeffer, S.E. (2014). Entity Recognition for Duplicate Filtering. In: Traina, A.J.M., Traina, C., Cordeiro, R.L.F. (eds) Similarity Search and Applications. SISAP 2014. Lecture Notes in Computer Science, vol 8821. Springer, Cham. https://doi.org/10.1007/978-3-319-11988-5_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-11988-5_24
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11987-8
Online ISBN: 978-3-319-11988-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics