Self-tuning in Graph-Based Reference Disambiguation

Nuray-Turan, Rabia; Kalashnikov, Dmitri V.; Mehrotra, Sharad

doi:10.1007/978-3-540-71703-4_29

Rabia Nuray-Turan¹,
Dmitri V. Kalashnikov¹ &
Sharad Mehrotra¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4443))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1439 Accesses
8 Citations

Abstract

Nowadays many data mining/analysis applications use the graph analysis techniques for decision making. Many of these techniques are based on the importance of relationships among the interacting units. A number of models and measures that analyze the relationship importance (link structure) have been proposed (e.g., centrality, importance and page rank) and they are generally based on intuition, where the analyst intuitively decides a reasonable model that fits the underlying data. In this paper, we address the problem of learning such models directly from training data. Specifically, we study a way to calibrate a connection strength measure from training data in the context of reference disambiguation problem. Experimental evaluation demonstrates that the proposed model surpasses the best model used for reference disambiguation in the past, leading to better quality of reference disambiguation.

This material is based upon work supported by the National Science Foundation under Award Numbers 0331707 and 0331690. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: VLDB (2002)
Google Scholar
Bekkerman, R., McCallum, A.: Disambiguating web appearances of people in a social network. In: WWW (2005)
Google Scholar
Bhattacharya, I., Getoor, L.: Relational clustering for multi-type entity resolution. In: MRDM Workshop (2005)
Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: SIGKDD (2003)
Google Scholar
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: Proc. of International World Wide Web Conference (1998)
Google Scholar
Chaudhuri, S., Ganjam, K., Ganti, V., Kapoor, R., Narasayya, V., Vassilakis, T.: Data cleaning in Microsoft SQL Server 2005. In: SIGMOD (2005)
Google Scholar
Chen, Z., Kalashnikov, D.V., Mehrotra, S.: Exploiting relationships for object consolidation. In: Proc. of International ACM SIGMOD Workshop on Information Quality in Information Systems (ACM IQIS 2005), Baltimore, MD, USA, June 17, 2005, ACM Press, New York (2005)
Google Scholar
Cohen, W., Kautz, H., McAllester, D.: Hardening soft information sources. In: SIGKDD (2000)
Google Scholar
Dong, X., Halevy, A.Y., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD (2005)
Google Scholar
Faloutsos, C., McCurley, K.S., Tomkins, A.: Fast discovery of connection subgraphs. In: SIGKDD (2004)
Google Scholar
Fellegi, I., Sunter, A.: A theory for record linkage. Journal of Amer. Statistical Association 64(328), 1183–1210 (1969)
Article Google Scholar
Hernandez, M., Stolfo, S.: The merge/purge problem for large databases. In: SIGMOD (1995)
Google Scholar
Hillier, F., Lieberman, G.: Introduction to operations research. McGraw-Hill, New York (2001)
Google Scholar
Kalashnikov, D.V., Chen, S., Nuray-Turan, R., Mehrotra, S., Ashish, N.: Disambiguation algorithm for people search on the web. In: Proc. of the IEEE 23rd International Conference on Data Engineering (IEEE ICDE 2007), Istanbul, Turkey, April 16–20, 2007, IEEE Computer Society Press, Los Alamitos (2007)
Google Scholar
Kalashnikov, D.V., Mehrotra, S.: RelDC project, http://www.ics.uci.edu/~dvk/RelDC
Kalashnikov, D.V., Mehrotra, S.: Domain-independent data cleaning via analysis of entity-relationship graph. ACM Transactions on Database Systems (ACM TODS) 31(2), 716–767 (2006)
Article Google Scholar
Kalashnikov, D.V., Mehrotra, S., Chen, Z.: Exploiting relationships for domain-independent data cleaning. In: SIAM International Conference on Data Mining (SIAM Data Mining 2005), Newport Beach, CA, USA, April 21–23 (2005)
Google Scholar
Li, X., Morie, P., Roth, D.: Identification and tracing of ambiguous names: Discriminative and generative approaches. In: AAAI (2004)
Google Scholar
Malin, B.: Unsupervised name disambiguation via social network similarity. In: Workshop on Link Analysis, Counterterrorism, and Security (2005)
Google Scholar
McCallum, A., Nigam, K., Ungar, L.: Efficient clustering of high-dimensional data sets with application to reference matching. In: ACM SIGKDD, ACM Press, New York (2000)
Google Scholar
McCallum, A., Wellner, B.: Object consolidation by graph partitioning with a conditionally-trained distance metric. In: KDD Workshop on Data Cleaning, Record Linkage and Object Consolidation (2003)
Google Scholar
Minkov, E., Cohen, W.W., Ng, A.: Contextual search and name disambiguation in email using graphs. In: SIGIR (2006)
Google Scholar
Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records. Science 130, 954–959 (1959)
Article Google Scholar
Pasula, H., Marthi, B., Milch, B., Russell, S., Shpitser, I.: Identity uncertainty and citation matching. In: NIPS Conference (2002)
Google Scholar
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: SIGKDD (2002)
Google Scholar
Shawe-Taylor, J., Cristianni, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)
Google Scholar
Singla, P., Domingos, P.: Multi-relational record linkage. In: MRDM Workshop (2004)
Google Scholar
Tejada, S., Knoblock, C.A., Minton, S.: Learning domain-independent string tranformation weights for high accuracy object identification. In: SIGKDD (2002)
Google Scholar
Wasserman, S., Faust, K.: Social Network Analysis Methods and Applications. Cambridge University Press, Cambridge (1994)
Google Scholar
White, S., Smyth, P.: Algorithms for estimating relative importance in networks. In: SIGKDD (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, University of California, Irvine,
Rabia Nuray-Turan, Dmitri V. Kalashnikov & Sharad Mehrotra

Authors

Rabia Nuray-Turan
View author publications
You can also search for this author in PubMed Google Scholar
Dmitri V. Kalashnikov
View author publications
You can also search for this author in PubMed Google Scholar
Sharad Mehrotra
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Ramamohanarao Kotagiri P. Radha Krishna Mukesh Mohania Ekawit Nantajeewarawat

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nuray-Turan, R., Kalashnikov, D.V., Mehrotra, S. (2007). Self-tuning in Graph-Based Reference Disambiguation. In: Kotagiri, R., Krishna, P.R., Mohania, M., Nantajeewarawat, E. (eds) Advances in Databases: Concepts, Systems and Applications. DASFAA 2007. Lecture Notes in Computer Science, vol 4443. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71703-4_29

Download citation

DOI: https://doi.org/10.1007/978-3-540-71703-4_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71702-7
Online ISBN: 978-3-540-71703-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics