Skip to main content

Self-tuning in Graph-Based Reference Disambiguation

  • Conference paper
Advances in Databases: Concepts, Systems and Applications (DASFAA 2007)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4443))

Included in the following conference series:

Abstract

Nowadays many data mining/analysis applications use the graph analysis techniques for decision making. Many of these techniques are based on the importance of relationships among the interacting units. A number of models and measures that analyze the relationship importance (link structure) have been proposed (e.g., centrality, importance and page rank) and they are generally based on intuition, where the analyst intuitively decides a reasonable model that fits the underlying data. In this paper, we address the problem of learning such models directly from training data. Specifically, we study a way to calibrate a connection strength measure from training data in the context of reference disambiguation problem. Experimental evaluation demonstrates that the proposed model surpasses the best model used for reference disambiguation in the past, leading to better quality of reference disambiguation.

This material is based upon work supported by the National Science Foundation under Award Numbers 0331707 and 0331690. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: VLDB (2002)

    Google Scholar 

  2. Bekkerman, R., McCallum, A.: Disambiguating web appearances of people in a social network. In: WWW (2005)

    Google Scholar 

  3. Bhattacharya, I., Getoor, L.: Relational clustering for multi-type entity resolution. In: MRDM Workshop (2005)

    Google Scholar 

  4. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: SIGKDD (2003)

    Google Scholar 

  5. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: Proc. of International World Wide Web Conference (1998)

    Google Scholar 

  6. Chaudhuri, S., Ganjam, K., Ganti, V., Kapoor, R., Narasayya, V., Vassilakis, T.: Data cleaning in Microsoft SQL Server 2005. In: SIGMOD (2005)

    Google Scholar 

  7. Chen, Z., Kalashnikov, D.V., Mehrotra, S.: Exploiting relationships for object consolidation. In: Proc. of International ACM SIGMOD Workshop on Information Quality in Information Systems (ACM IQIS 2005), Baltimore, MD, USA, June 17, 2005, ACM Press, New York (2005)

    Google Scholar 

  8. Cohen, W., Kautz, H., McAllester, D.: Hardening soft information sources. In: SIGKDD (2000)

    Google Scholar 

  9. Dong, X., Halevy, A.Y., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD (2005)

    Google Scholar 

  10. Faloutsos, C., McCurley, K.S., Tomkins, A.: Fast discovery of connection subgraphs. In: SIGKDD (2004)

    Google Scholar 

  11. Fellegi, I., Sunter, A.: A theory for record linkage. Journal of Amer. Statistical Association 64(328), 1183–1210 (1969)

    Article  Google Scholar 

  12. Hernandez, M., Stolfo, S.: The merge/purge problem for large databases. In: SIGMOD (1995)

    Google Scholar 

  13. Hillier, F., Lieberman, G.: Introduction to operations research. McGraw-Hill, New York (2001)

    Google Scholar 

  14. Kalashnikov, D.V., Chen, S., Nuray-Turan, R., Mehrotra, S., Ashish, N.: Disambiguation algorithm for people search on the web. In: Proc. of the IEEE 23rd International Conference on Data Engineering (IEEE ICDE 2007), Istanbul, Turkey, April 16–20, 2007, IEEE Computer Society Press, Los Alamitos (2007)

    Google Scholar 

  15. Kalashnikov, D.V., Mehrotra, S.: RelDC project, http://www.ics.uci.edu/~dvk/RelDC

  16. Kalashnikov, D.V., Mehrotra, S.: Domain-independent data cleaning via analysis of entity-relationship graph. ACM Transactions on Database Systems (ACM TODS) 31(2), 716–767 (2006)

    Article  Google Scholar 

  17. Kalashnikov, D.V., Mehrotra, S., Chen, Z.: Exploiting relationships for domain-independent data cleaning. In: SIAM International Conference on Data Mining (SIAM Data Mining 2005), Newport Beach, CA, USA, April 21–23 (2005)

    Google Scholar 

  18. Li, X., Morie, P., Roth, D.: Identification and tracing of ambiguous names: Discriminative and generative approaches. In: AAAI (2004)

    Google Scholar 

  19. Malin, B.: Unsupervised name disambiguation via social network similarity. In: Workshop on Link Analysis, Counterterrorism, and Security (2005)

    Google Scholar 

  20. McCallum, A., Nigam, K., Ungar, L.: Efficient clustering of high-dimensional data sets with application to reference matching. In: ACM SIGKDD, ACM Press, New York (2000)

    Google Scholar 

  21. McCallum, A., Wellner, B.: Object consolidation by graph partitioning with a conditionally-trained distance metric. In: KDD Workshop on Data Cleaning, Record Linkage and Object Consolidation (2003)

    Google Scholar 

  22. Minkov, E., Cohen, W.W., Ng, A.: Contextual search and name disambiguation in email using graphs. In: SIGIR (2006)

    Google Scholar 

  23. Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records. Science 130, 954–959 (1959)

    Article  Google Scholar 

  24. Pasula, H., Marthi, B., Milch, B., Russell, S., Shpitser, I.: Identity uncertainty and citation matching. In: NIPS Conference (2002)

    Google Scholar 

  25. Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: SIGKDD (2002)

    Google Scholar 

  26. Shawe-Taylor, J., Cristianni, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)

    Google Scholar 

  27. Singla, P., Domingos, P.: Multi-relational record linkage. In: MRDM Workshop (2004)

    Google Scholar 

  28. Tejada, S., Knoblock, C.A., Minton, S.: Learning domain-independent string tranformation weights for high accuracy object identification. In: SIGKDD (2002)

    Google Scholar 

  29. Wasserman, S., Faust, K.: Social Network Analysis Methods and Applications. Cambridge University Press, Cambridge (1994)

    Google Scholar 

  30. White, S., Smyth, P.: Algorithms for estimating relative importance in networks. In: SIGKDD (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Ramamohanarao Kotagiri P. Radha Krishna Mukesh Mohania Ekawit Nantajeewarawat

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Nuray-Turan, R., Kalashnikov, D.V., Mehrotra, S. (2007). Self-tuning in Graph-Based Reference Disambiguation. In: Kotagiri, R., Krishna, P.R., Mohania, M., Nantajeewarawat, E. (eds) Advances in Databases: Concepts, Systems and Applications. DASFAA 2007. Lecture Notes in Computer Science, vol 4443. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71703-4_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-71703-4_29

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-71702-7

  • Online ISBN: 978-3-540-71703-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics