PACE: A General-Purpose Tool for Authority Control

  • Paolo Manghi
  • Marko Mikulicic
Part of the Communications in Computer and Information Science book series (CCIS, volume 240)

Abstract

Curating the records of an authority file is an activity as important as committing for many organizations, which have to rely on experts equipped with so-called authority control tools, capable of automatically supporting complex disambiguation workflows through user-friendly interfaces. This paper presents PACE, an open source authority control tool which offers user interfaces for (i) customizing the structure (ontology) of authority files, (ii) tune-up probabilistic disambiguation of authority files through a set of similarity functions for detecting record candidates for duplication and overload (iii) curate such authority files by applying record merges and splitting actions, and (iv) expose authority files to third-party consumers in several ways. PACE’s back-end is based on Cassandra’s “NOSQL”technology to offer (i) read-write performances that scale up linearly with the number of records and (ii) parallel and efficient (MapReduce-based) record sorting and matching algorithms.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Benjelloun, O., Garcia-Molina, H., Su, Q., Widom, J.: Swoosh: A generic approach to entity resolution. Stanford University technical report (March 2005)Google Scholar
  2. 2.
    Charikar, M.: Similarity estimation techniques from rounding algorithms. In: 34th Annual Symposium on Theory and Computing, Montreal, Quebec, Canada (May 2002)Google Scholar
  3. 3.
    Christen, T., Churches, P., Zhu, J.: Probabilistic name and address cleaning and standardization. In: The Australian Data Mining Workshop (November 2002)Google Scholar
  4. 4.
    Churches, T., Christen, P., Lu, J., Zhu, J.X.: Preparation of name and address data for record linkage using hidden markov models. BioMed Central Medical Informatics and Decision Making 2(9) (2002)Google Scholar
  5. 5.
    Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string metrics for matching names and addresses. In: International Joint Conference on Artificial Intelligence, Proceedings of the Workshop on Information Integration on the Web (August 2003)Google Scholar
  6. 6.
    Dalrymple, P.W., Young, J.A.: From authority control to informed retrieval: Framing the expanded domain of subject access. College & Research Libraries 52, 139–149 (1991)CrossRefGoogle Scholar
  7. 7.
    Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19(1), 1–16 (2007)CrossRefGoogle Scholar
  8. 8.
    Fayad, U., Uthurusamy, R.: Evolving data mining into solutions for insights. Communications of the Association of Computing Machinery 45(8), 28–31 (2002)CrossRefGoogle Scholar
  9. 9.
    Gong, C., Huang, Y., Cheng, X., Bai, S.: Detecting near-duplicates in large-scale short text databases. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 877–883. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  10. 10.
    Gorman, M.: Authority control in the context of bibliographic control in the electronic environment. In: International Conference Authority Control: Definition and International Experiences, Florence, February 10-12 (2003)Google Scholar
  11. 11.
    Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev. 44, 35–40 (2010)CrossRefGoogle Scholar
  12. 12.
    Manku, G., Jain, A., S.A.D.: Detecting near-duplicates for web crawling. In: 16th International World Wide Conference, Banff, Alberta, Canada (May 2007)Google Scholar
  13. 13.
    Rick, B., Hengel-Dittrich, C., O’Neill, E.T., Tillett, B.: Viaf (virtual international authority file): Linking the deutsche nationalbibliothek and library of congress name authority files. International Cataloging and Bibliographic Control 36(1), 12–19 (2007)Google Scholar
  14. 14.
    Tejada, S., Knoblock, C., Minton, S.: Learning object identification rules for information extraction. Information Systems 26(8), 607–633 (2001)CrossRefMATHGoogle Scholar
  15. 15.
    Tillett, B.T.: Authority control: State of the art and new perspectives. In: Authority Control International Conference, Florence, Italy (2003)Google Scholar
  16. 16.
    Wang, C., Wang, J., Lin, X., Wang, W., Wang, H., Li, H., Tian, W., Xu, J., Li, R.: Mapdupreducer: detecting near duplicates over massive datasets. In: Proceedings of the 2010 International Conference on Management of Data, SIGMOD 2010, pp. 1119–1122. ACM, New York (2010)Google Scholar
  17. 17.
    Weber, J.: Leaf. linking and exploring authority files. In: International Conference Authority Control: Definition and International Experiences, Florence, February 10-12 (2003)Google Scholar
  18. 18.
    Winkler, W.E.: String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 354–359 (1990)Google Scholar
  19. 19.
    Winkler, W.E.: Overview of record linkage and current research directions. Technical report, Research Report Series, RRS (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Paolo Manghi
    • 1
  • Marko Mikulicic
    • 1
  1. 1.Istituto di Scienza e Tecnologie dell’Informazione “Alessandro Faedo”, Consiglio Nazionale delle RicerchePisaItaly

Personalised recommendations