New Generation Computing

, Volume 22, Issue 3, pp 239–252 | Cite as

Selecting potentially relevant records using re-identification methods

  • Josep Domingo-Ferrer
  • Vicenç Torra
Regular Papers

Abstract

This work proposes re-identification algorithms to select records that are interesting from the point of view of giving new information. Instead of focusing on re-identified elements, we focus on non re-identified records (non linked records) as they are the ones that potentially supply new and relevant information. Moreover, these relevant characteristics can correspond to chances for improving the knowledge of a system.

To evaluate our approach, we have applied it to a example using publicly available data from the UCI repository. We have used the data of theionosphere data base to build a re-identification problem for 35 non-common variables.

We show that the use of a simple heuristic rule base can effectively select potentially interesting records.

Keywords

Chance Discovery Knowledge Discovery in Databases Data Mining Multi-database Mining Re-identification Algorithms Record Selection Record Linkage 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1).
    Prendinger, H. and Ishizuka, M., “Methodological Considerations on Chance Discovery,”Lecture Notes on Artificial Intelligence, 2253, pp. 425–434, 2001.CrossRefGoogle Scholar
  2. 2).
    Ohsawa, Y. and Fukuda, H., “Potential Motivations as Fountains of Chances,” inProc. of the IEEE Int. Conf. on Industrial Electronics, Control and Instrumentation (IECON 2000), pp. 1626–1631, 2000Google Scholar
  3. 3).
    Horiguchi, T. and Hirashima, T., “The Role of Counterexamples in Discovery Learning Environment: Awareness of the Chance for Learning,”Lecture Notes on Artificial Intelligence, 2253, pp. 468–474, 2001.CrossRefGoogle Scholar
  4. 4).
    Winkler, W. E., “Matching and Record Linkage”, in Gox, B. G. (ed.),Business Survey Methods, J. Wiley, pp. 355–384, 1995.Google Scholar
  5. 5).
    Domingo-Ferrer, J. and Torra, V., “Disclosure Control Methods and Information Loss for Microdata”, pp. 91–110, in Doyle, P., Lane, J. I., Theeuwes, J. J. M., Zayatz, L. M. (eds.),Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, Elsevier, 2001.Google Scholar
  6. 6).
    Torra, V., “Towards the Re-identification of Individuals in Data Files with Noncommon Variables”, inProc. of the European Conference on Artificial Intelligence. ECAI, pp. 326–330, Berlin, Germany, 2000.Google Scholar
  7. 7).
    Domingo-Ferrer, J. and Torra, V., “Validating Distance-based Record Linkage with Probabilistic-based One,” inProc. of the 5th Catalan Conference on Artificial Intelligence, in Escrig, M. T., Toledo, F. and Golobardes, E. (eds.), “Topics in Artificial Intelligence”,Lecture Notes on Artificial Intelligence, 2504, pp. 207–215, 2002.Google Scholar
  8. 8).
    Fellegi, I. P. and Sunter, A. B., “A Theory for Record Linkage,”Journ. of the American Statistical Association, 64, 328, pp. 1183–1210, 1969.CrossRefGoogle Scholar
  9. 9).
    Jaro, M. A., “Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida,”Journ. of the American Statistical Association, 84, 406, pp. 414–420, 1989.CrossRefGoogle Scholar
  10. 10).
    Torra, V. and Domingo-Ferrer, J., “Record Linkage Methods for Multidatabase Data Mining,” in Torra, V. (ed.),Information Fusion in Data Mining, pp. 101–132, ISBN 3-540-00676-1, Springer, 2003.Google Scholar
  11. 11).
    Winkler, W. E., “Advanced Methods for Record Linkage,” American Statistical Association, inProc. of the Section on Survey Research Methods, pp. 467–472, 1995.Google Scholar
  12. 12).
    Dempster, A. P., Lairrd, N. M. and Rubin, D. B., “Maximum Likelihood from Incomplete Data Via the EM Algorithm,”Journ. of the Royal Statistical Society, 39, pp. 1–38, 1977.MATHGoogle Scholar
  13. 13).
    Leicester, G., “Methods for Automatic Record Matching and Linking and Their Use in National Statistics,” Office for National Statistics, London, 2001.Google Scholar
  14. 14).
    Pagliuca, D. and Seri, G., “Some Results of Individual Ranking Method on the System of Enterprise Accounts Annual Survey, Esprit SDC Project,” Deliverable MI-3/D2, 1999.Google Scholar
  15. 15).
    Domingo-Ferrer, J. and Torra, V., “A Quantitative Comparison of Disclosure Control Methods for Microdata,” pp. 111–133, in Doyle, P., Lane, J. I., Theeuwes, J. J. M. and Zayatz, L. M. (eds.),Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, Elsevier 2001.Google Scholar
  16. 16).
    Bacher, J., Brand, R. and Bender, S., “Re-identifying Register Data by Survey Data Using Cluster Analysis: an Empirical Study,” inInt. J. of Unc., Fuzziness and Knowledge-Based Systems, 10, 5, pp. 589–607, 2002.Google Scholar
  17. 17).
    Torra, V., “Re-identifying Individuals Using OWA Operators,” inProc. Int. Conf. Soft Comp., lizuka, Japan, 2000.Google Scholar
  18. 18).
    Yager, R. R., “On Ordered Weighted Averaging Aggregation Operators in Multicriteria Decision Making,”IEEE Trans. on SMC, 18, pp. 183–190, 1988.MATHMathSciNetGoogle Scholar
  19. 19).
    Ohsawa, Y., “The Scope of Chance Discovery,“Lecture Notes on Artificial Intelligence, 2253, pp. 413, 2001.CrossRefGoogle Scholar
  20. 20).
    Murphy, P. M. and Aha, D. W., UCI Repository Machine Learning Databases http://www.ics.uci.edu/mlearn/MLRepository.html, University of California, Department of Information and Computer Science, Irvine, CA, 1994.Google Scholar

Copyright information

© Ohmsha, Ltd. and Springer 2004

Authors and Affiliations

  • Josep Domingo-Ferrer
    • 1
  • Vicenç Torra
    • 2
  1. 1.Dept. Comput. Eng. and Maths-ETSEUniversitat Rovira i VirgiliTarragona, CataloniaSpain
  2. 2.Institut d’Investigació en Intel-ligència Artificial-CSICBellaterra, CataloniaSpain

Personalised recommendations