Object Identification with Attribute-Mediated Dependences

  • Parag Singla
  • Pedro Domingos
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3721)


Object identification is the problem of determining whether different observations correspond to the same object. It occurs in a wide variety of fields, including vision, natural language, citation matching, and information integration. Traditionally, the problem is solved separately for each pair of observations, followed by transitive closure. We propose solving it collectively, performing simultaneous inference for all candidate match pairs, and allowing information to propagate from one candidate match to another via the attributes they have in common. Our formulation is based on conditional random fields, and allows an optimal solution to be found in polynomial time using a graph cut algorithm. Parameters are learned using a voted perceptron algorithm. Experiments on real and synthetic datasets show that this approach outperforms the standard one.


Transitive Closure Conditional Random Field Collective Model Candidate Pair Record Pair 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Agresti, A.: Categorical Data Analysis. Wiley, Chichester (1990)zbMATHGoogle Scholar
  2. 2.
    Bhattacharya, I., Getoor, L.: Iterative record linkage for cleaning and integration. In: Proc. SIGMOD 2004 DMKD Wkshp. (2004)Google Scholar
  3. 3.
    Bilenko, M., Mooney, R.: Adaptive duplicate detection using learnable string similarity measures. In: Proc. KDD 2003, pp. 7–12 (2003)Google Scholar
  4. 4.
    Cohen, W., Kautz, H., McAllester, D.: Hardening soft information sources. In: Proc. KDD 2000, pp. 255–259 (2000)Google Scholar
  5. 5.
    Cohen, W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proc. KDD 2002, pp. 475–480 (2002)Google Scholar
  6. 6.
    Collins, M.: Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In: EMNLP 2002, pp. 1–8 (2002)Google Scholar
  7. 7.
    Fellegi, I., Sunter, A.: A theory for record linkage. J. American Statistical Association 64, 1183–1210 (1969)CrossRefGoogle Scholar
  8. 8.
    Greig, D., Porteous, B., Seheult, A.: Exact maximum a posteriori estimation for binary images. J. Royal Statistical Society B 51, 271–279 (1989)Google Scholar
  9. 9.
    Hernandez, M., Stolfo, S.: The merge/purge problem for large databases. In: Proc. SIGMOD 1995, pp. 127–138 (1995)Google Scholar
  10. 10.
    Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. ICML 2001, pp. 282–289 (2001)Google Scholar
  11. 11.
    McCallum, A., Nigam, K., Ungar, L.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proc. KDD 2000, pp. 169–178 (2000)Google Scholar
  12. 12.
    McCallum, A., Wellner, B.: Conditional models of identity uncertainty with application to noun coreference. In: Adv. NIPS 17, pp. 905–912 (2005)Google Scholar
  13. 13.
    Monge, A., Elkan, C.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proc. SIGMOD 1997 DMKD Wkshp. (1997)Google Scholar
  14. 14.
    Newcombe, H., Kennedy, J., Axford, S., James, A.: Automatic linkage of vital records. Science 130, 954–959 (1959)CrossRefGoogle Scholar
  15. 15.
    Pasula, H., Marthi, B., Milch, B., Russell, S., Shpitser, I.: Identity uncertainty and citation matching. In: Adv. NIPS 15, pp. 1401–1408 (2003)Google Scholar
  16. 16.
    Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)zbMATHGoogle Scholar
  17. 17.
    Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proc. KDD 2002, pp. 269–278 (2002)Google Scholar
  18. 18.
    Taskar, B., Abbeel, P., Koller, D.: Discriminative probabilistic models for relational data. In: Proc. UAI 2002, pp. 485–492 (2002)Google Scholar
  19. 19.
    Taskar, B., Guestrin, C., Milch, B., Koller, D.: Max-margin Markov networks. Adv. NIPS 16 (2004)Google Scholar
  20. 20.
    Tejada, S., Knoblock, C., Minton, S.: Learning domain-independent string transformation weights for high accuracy object identification. In: Proc. KDD 2002, pp. 350–359 (2002)Google Scholar
  21. 21.
    Winkler, W.: The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Census Bureau (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Parag Singla
    • 1
  • Pedro Domingos
    • 1
  1. 1.Department of Computer Science and EngineeringUniversity of WashingtonSeattleUSA

Personalised recommendations