PACE: A General-Purpose Tool for Authority Control
Curating the records of an authority file is an activity as important as committing for many organizations, which have to rely on experts equipped with so-called authority control tools, capable of automatically supporting complex disambiguation workflows through user-friendly interfaces. This paper presents PACE, an open source authority control tool which offers user interfaces for (i) customizing the structure (ontology) of authority files, (ii) tune-up probabilistic disambiguation of authority files through a set of similarity functions for detecting record candidates for duplication and overload (iii) curate such authority files by applying record merges and splitting actions, and (iv) expose authority files to third-party consumers in several ways. PACE’s back-end is based on Cassandra’s “NOSQL”technology to offer (i) read-write performances that scale up linearly with the number of records and (ii) parallel and efficient (MapReduce-based) record sorting and matching algorithms.
Unable to display preview. Download preview PDF.
- 1.Benjelloun, O., Garcia-Molina, H., Su, Q., Widom, J.: Swoosh: A generic approach to entity resolution. Stanford University technical report (March 2005)Google Scholar
- 2.Charikar, M.: Similarity estimation techniques from rounding algorithms. In: 34th Annual Symposium on Theory and Computing, Montreal, Quebec, Canada (May 2002)Google Scholar
- 3.Christen, T., Churches, P., Zhu, J.: Probabilistic name and address cleaning and standardization. In: The Australian Data Mining Workshop (November 2002)Google Scholar
- 4.Churches, T., Christen, P., Lu, J., Zhu, J.X.: Preparation of name and address data for record linkage using hidden markov models. BioMed Central Medical Informatics and Decision Making 2(9) (2002)Google Scholar
- 5.Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string metrics for matching names and addresses. In: International Joint Conference on Artificial Intelligence, Proceedings of the Workshop on Information Integration on the Web (August 2003)Google Scholar
- 10.Gorman, M.: Authority control in the context of bibliographic control in the electronic environment. In: International Conference Authority Control: Definition and International Experiences, Florence, February 10-12 (2003)Google Scholar
- 12.Manku, G., Jain, A., S.A.D.: Detecting near-duplicates for web crawling. In: 16th International World Wide Conference, Banff, Alberta, Canada (May 2007)Google Scholar
- 13.Rick, B., Hengel-Dittrich, C., O’Neill, E.T., Tillett, B.: Viaf (virtual international authority file): Linking the deutsche nationalbibliothek and library of congress name authority files. International Cataloging and Bibliographic Control 36(1), 12–19 (2007)Google Scholar
- 15.Tillett, B.T.: Authority control: State of the art and new perspectives. In: Authority Control International Conference, Florence, Italy (2003)Google Scholar
- 16.Wang, C., Wang, J., Lin, X., Wang, W., Wang, H., Li, H., Tian, W., Xu, J., Li, R.: Mapdupreducer: detecting near duplicates over massive datasets. In: Proceedings of the 2010 International Conference on Management of Data, SIGMOD 2010, pp. 1119–1122. ACM, New York (2010)Google Scholar
- 17.Weber, J.: Leaf. linking and exploring authority files. In: International Conference Authority Control: Definition and International Experiences, Florence, February 10-12 (2003)Google Scholar
- 18.Winkler, W.E.: String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 354–359 (1990)Google Scholar
- 19.Winkler, W.E.: Overview of record linkage and current research directions. Technical report, Research Report Series, RRS (2006)Google Scholar