Abstract
Nowadays many data mining/analysis applications use the graph analysis techniques for decision making. Many of these techniques are based on the importance of relationships among the interacting units. A number of models and measures that analyze the relationship importance (link structure) have been proposed (e.g., centrality, importance and page rank) and they are generally based on intuition, where the analyst intuitively decides a reasonable model that fits the underlying data. In this paper, we address the problem of learning such models directly from training data. Specifically, we study a way to calibrate a connection strength measure from training data in the context of reference disambiguation problem. Experimental evaluation demonstrates that the proposed model surpasses the best model used for reference disambiguation in the past, leading to better quality of reference disambiguation.
This material is based upon work supported by the National Science Foundation under Award Numbers 0331707 and 0331690. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: VLDB (2002)
Bekkerman, R., McCallum, A.: Disambiguating web appearances of people in a social network. In: WWW (2005)
Bhattacharya, I., Getoor, L.: Relational clustering for multi-type entity resolution. In: MRDM Workshop (2005)
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: SIGKDD (2003)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: Proc. of International World Wide Web Conference (1998)
Chaudhuri, S., Ganjam, K., Ganti, V., Kapoor, R., Narasayya, V., Vassilakis, T.: Data cleaning in Microsoft SQL Server 2005. In: SIGMOD (2005)
Chen, Z., Kalashnikov, D.V., Mehrotra, S.: Exploiting relationships for object consolidation. In: Proc. of International ACM SIGMOD Workshop on Information Quality in Information Systems (ACM IQIS 2005), Baltimore, MD, USA, June 17, 2005, ACM Press, New York (2005)
Cohen, W., Kautz, H., McAllester, D.: Hardening soft information sources. In: SIGKDD (2000)
Dong, X., Halevy, A.Y., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD (2005)
Faloutsos, C., McCurley, K.S., Tomkins, A.: Fast discovery of connection subgraphs. In: SIGKDD (2004)
Fellegi, I., Sunter, A.: A theory for record linkage. Journal of Amer. Statistical Association 64(328), 1183–1210 (1969)
Hernandez, M., Stolfo, S.: The merge/purge problem for large databases. In: SIGMOD (1995)
Hillier, F., Lieberman, G.: Introduction to operations research. McGraw-Hill, New York (2001)
Kalashnikov, D.V., Chen, S., Nuray-Turan, R., Mehrotra, S., Ashish, N.: Disambiguation algorithm for people search on the web. In: Proc. of the IEEE 23rd International Conference on Data Engineering (IEEE ICDE 2007), Istanbul, Turkey, April 16–20, 2007, IEEE Computer Society Press, Los Alamitos (2007)
Kalashnikov, D.V., Mehrotra, S.: RelDC project, http://www.ics.uci.edu/~dvk/RelDC
Kalashnikov, D.V., Mehrotra, S.: Domain-independent data cleaning via analysis of entity-relationship graph. ACM Transactions on Database Systems (ACM TODS) 31(2), 716–767 (2006)
Kalashnikov, D.V., Mehrotra, S., Chen, Z.: Exploiting relationships for domain-independent data cleaning. In: SIAM International Conference on Data Mining (SIAM Data Mining 2005), Newport Beach, CA, USA, April 21–23 (2005)
Li, X., Morie, P., Roth, D.: Identification and tracing of ambiguous names: Discriminative and generative approaches. In: AAAI (2004)
Malin, B.: Unsupervised name disambiguation via social network similarity. In: Workshop on Link Analysis, Counterterrorism, and Security (2005)
McCallum, A., Nigam, K., Ungar, L.: Efficient clustering of high-dimensional data sets with application to reference matching. In: ACM SIGKDD, ACM Press, New York (2000)
McCallum, A., Wellner, B.: Object consolidation by graph partitioning with a conditionally-trained distance metric. In: KDD Workshop on Data Cleaning, Record Linkage and Object Consolidation (2003)
Minkov, E., Cohen, W.W., Ng, A.: Contextual search and name disambiguation in email using graphs. In: SIGIR (2006)
Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records. Science 130, 954–959 (1959)
Pasula, H., Marthi, B., Milch, B., Russell, S., Shpitser, I.: Identity uncertainty and citation matching. In: NIPS Conference (2002)
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: SIGKDD (2002)
Shawe-Taylor, J., Cristianni, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)
Singla, P., Domingos, P.: Multi-relational record linkage. In: MRDM Workshop (2004)
Tejada, S., Knoblock, C.A., Minton, S.: Learning domain-independent string tranformation weights for high accuracy object identification. In: SIGKDD (2002)
Wasserman, S., Faust, K.: Social Network Analysis Methods and Applications. Cambridge University Press, Cambridge (1994)
White, S., Smyth, P.: Algorithms for estimating relative importance in networks. In: SIGKDD (2003)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nuray-Turan, R., Kalashnikov, D.V., Mehrotra, S. (2007). Self-tuning in Graph-Based Reference Disambiguation. In: Kotagiri, R., Krishna, P.R., Mohania, M., Nantajeewarawat, E. (eds) Advances in Databases: Concepts, Systems and Applications. DASFAA 2007. Lecture Notes in Computer Science, vol 4443. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71703-4_29
Download citation
DOI: https://doi.org/10.1007/978-3-540-71703-4_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71702-7
Online ISBN: 978-3-540-71703-4
eBook Packages: Computer ScienceComputer Science (R0)