A Multi-layer Naïve Bayes Model for Approximate Identity Matching

  • G. Alan Wang
  • Hsinchun Chen
  • Homa Atabakhsh
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3975)


Identity management is critical to various governmental practices ranging from providing citizens services to enforcing homeland security. The task of searching for a specific identity is difficult because multiple identity representations may exist due to issues related to unintentional errors and intentional deception. We propose a Naïve Bayes identity matching model that improves existing techniques in terms of effectiveness. Experiments show that our proposed model performs significantly better than the exact-match based technique and achieves higher precision than the record comparison technique. In addition, our model greatly reduces the efforts of manually labeling training instances by employing a semi-supervised learning approach. This training method outperforms both fully supervised and unsupervised learning. With a training dataset that only contains 30% labeled instances, our model achieves a performance comparable to that of a fully supervised learning.


Bayesian Network Training Dataset Unsupervised Learning Identity Information Approximate Identity 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Camp, J.: Identity in Digital Government. In: Proceedings of 2003 Civic Scenario Workshop: an Event of the Kennedy School of Government. Cambridge, MA 02138 (2003)Google Scholar
  2. 2.
    Wang, A.G., Atabakhsh, H., Petersen, T., Chen, H.: Discovering identity problems: A case study. In: Kantor, P., Muresan, G., Roberts, F., Zeng, D.D., Wang, F.-Y., Chen, H., Merkle, R.C. (eds.) ISI 2005. LNCS, vol. 3495, pp. 368–373. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  3. 3.
    Redman, T.C.: The Impact of Poor Data Quality on the Typical Enterprises. Communications of the ACM 41(3), 79–82 (1998)CrossRefGoogle Scholar
  4. 4.
    Wang, G., Chen, H., Atabakhsh, H.: Automatically Detecting Deceptive Criminal Identities. Communications of the ACM 47(3), 71–76 (2004)CrossRefGoogle Scholar
  5. 5.
    Marshall, B., Kaza, S., Xu, J., Atabakhsh, H., Petersen, T., Violette, C., Chen, H.: Cross-Jurisdictional criminal activity networks to support border and transportation security. In: Proceedings of 7th Annual IEEE Conference on Intelligent Transportation Systems (ITSC 2004), Washington, D.C (2004)Google Scholar
  6. 6.
    Ravikumar, P., Cohen, W.W.: A Hierarchical Graphical Model for Record Linkage. In: Proceedings of 20th Conference on Uncertainty in Artificial Intelligence (UAI 2004), Banff Park Lodge, Banff, Canada (2004)Google Scholar
  7. 7.
    Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning 39, 103–134 (2000)MATHCrossRefGoogle Scholar
  8. 8.
    Winkler, W.E.: Methods for Record Linkage and Bayesian Networks. In: Proceedings of Section on Survey Research Methods, American Statistical Association, Alexandria, Virginia (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • G. Alan Wang
    • 1
  • Hsinchun Chen
    • 1
  • Homa Atabakhsh
    • 1
  1. 1.Department of Management Information SystemsUniversity of ArizonaTucsonU.S.A.

Personalised recommendations