Abstract
We propose a system for automatic detection of duplicate entries in a repository of semi-structured text documents. The proposed system employs text-entity recognition to extract information regarding time, location, names of persons and organizations, as well as events described within the document content. With structured representations of the content, called “metamodels”, we group the entries into clusters based on the similarity of the contents. Then we apply machine-learning algorithms to the clusters to carry out duplicate detection. We present results regarding precision, recall, and F-value of the proposed system.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Agarwal, P., Vaithiyanathan, R., Sharma, S., Shroff, G.: Catching the long-tail: Extracting local news events from Twitter. In: Proc. of the 6th International AAAI Conference on Weblogs and Social Media, Palo Alto, CA, USA, pp. 379–382. AAAI (2012)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval. ACM Press, New York (1999)
Bishop, C.: Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, New York (2006)
Church, K.: A stochastic parts program and noun phrase parser for unrestricted text. In: Proc. of the 2nd Conference on Applied Natural Language Processing, Stroudsburg, PA, USA, pp. 136–143. Association for Computational Linguistics (1988)
Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. Wiley-Interscience, New York (2000)
Esposito, R., Radicioni, D.: CarpeDiem: Optimizing the Viterbi algorithm and applications to supervised sequential learning. The Journal of Machine Learning Research 10, 1851–1880 (2009)
Gong, C., Huang, Y., Cheng, X., Bai, S.: Detecting near-duplicates in large-scale short text databases. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 877–883. Springer, Heidelberg (2008)
Hall, P., Dowling, G.: Approximate string matching. ACM Computing Surveys 12(4), 381–402 (1980)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer Series in Statistics. Springer, New York (2009)
Jurafsky, D., Martin, J.: Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall PTR, Upper Saddle River (2000)
Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10, 707 (1966)
Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)
Ponomareva, N., Rosso, P., Pla, F., Molina, A.: Conditional random fields vs. hidden Markov models in a biomedical named entity recognition task. In: Proc. of International Conference Recent Advances in Natural Language Processing, Borovets, Bulgaria, pp. 479–483. RANLP 2007 Organising Committee (2007)
Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: Proc. of the 13th Conference on Computational Natural Language Learning, Stroudsburg, PA, USA, pp. 147–155. Association for Computational Linguistics (2009)
Ritter, A., Clark, S., Mausam, Etzioni, O.: Named entity recognition in tweets: An experimental study. In: Proc. of the Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA, pp. 1524–1534. Association for Computational Linguistics (2011)
Ross, S.: A first course in probability. Pearson Prentice Hall, Harlow (2010)
Sankaranarayanan, J., Samet, H., Teitler, B., Lieberman, M., Sperling, J.: TwitterStand: News in tweets. In: Proc. of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 42–51. ACM, New York (2009)
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proc. of the 2003 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, pp. 76–85. ACM (2003)
Tao, K., Abel, F., Hauff, C., Houben, G.-J., Gadiraju, U.: Groundhog Day: Near-duplicate Detection on Twitter. In: Proc. of the 22nd International Conference on World Wide Web, Republic and Canton of Geneva, Switzerland, pp. 1273–1284. International World Wide Web Conferences Steering Committee (2013)
Weis, M., Naumann, F.: DogmatiX tracks down duplicates in xml. In: Proc. of the 2005 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, pp. 431–442. ACM (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Cordero Cruz, J.A., Garza, S.E., Schaeffer, S.E. (2014). Entity Recognition for Duplicate Filtering. In: Traina, A.J.M., Traina, C., Cordeiro, R.L.F. (eds) Similarity Search and Applications. SISAP 2014. Lecture Notes in Computer Science, vol 8821. Springer, Cham. https://doi.org/10.1007/978-3-319-11988-5_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-11988-5_24
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11987-8
Online ISBN: 978-3-319-11988-5
eBook Packages: Computer ScienceComputer Science (R0)