Linking temporal records Research Article First Online: 20 May 2012 Received: 01 January 2012 Accepted: 02 February 2012 DOI:
Cite this article as: Li, P., Dong, X.L., Maurino, A. et al. Front. Comput. Sci. (2012) 6: 293. doi:10.1007/s11704-012-2002-5 Abstract
Many data sets contain
temporal records which span a long period of time; each record is associated with a time stamp and describes some aspects of a real-world entity at a particular time (e.g., author information in DBLP). In such cases, we often wish to identify records that describe the same entity over time and so be able to perform interesting longitudinal data analysis. However, existing record linkage techniques ignore temporal information and fall short for temporal data.
This article studies linking temporal records. First, we apply
time decay to capture the effect of elapsed time on entity value evolution. Second, instead of comparing each pair of records locally, we propose clustering methods that consider the time order of the records and make global decisions. Experimental results show that our algorithms significantly outperform traditional linkage methods on various temporal data sets.
Pei Li is a PhD student at Università di Milano Bicocca. Currently she is nearing the completion of her doctoral thesis in Computer Science. Previously, she studied electronic engineering at Beijing University of Posts and Telecommunications, where she received her BS and MS degrees. Her research interests are data integration and record linkage, with special focus on entity resolution with value inconsistency.
Xin Luna Dong is a researcher at AT&T Labs-Research. She received her PhD from University of Washington in 2007, received her Master’s Degree from Peking University in China in 2001, and her Bachelor’s Degree from Nankai University in China in 1998. Her research interests include databases, information retrieval, and machine learning, with an emphasis on data integration, data cleaning, personal information management, and Web search. She is cochairing Sigmod/PODS PhD Symposium 2012, Sigmod New Researcher Symposium 2012, and QDB (Quality of DataBases) 2012, has co-chaired WebDB’10, was a co-editor of the IEEE Data Engineering special issue on Towards quality data with fusion and cleaning, and has served in the program committees of VLDB’12, Sigmod’ 12, VLDB’11, Sigmod’11, VLDB’10, www’10, ICDE’10, and VLDB’09.
Andrea Maurino is an assistant professor at Università di Milano Bicocca, his research interest covers many areas in the field of database systems and service science. In the field of data quality, his research interests are record linkage, cooperative information systems, and assessment techniques of data intensive web applications. In the field of service science he focuses on the analysis of quality of services and non functional properties. He is the author of more than 40 papers including international journals and conferences; he is also the author of 4 book chapters. He was program co-chair of QDB’09 workshop, guest editor of IEEE Internet Computing in 2010. He is a reviewer for several journals including Information Systems, Knowledge Data and Engineering.
Divesh Srivastava is the head of the Database Research Department at AT&T Labs-Research. He received his PhD from the University of Wisconsin, Madison, and his BTech from the Indian Institute of Technology, Bombay. He is a Fellow of the ACM, on the board of trustees of the VLDB Endowment, and an associate editor of the ACM Transactions on Database Systems. He has served as the associate Editor-in-Chief of the IEEE Transactions on Knowledge and Data Engineering, and the program committee co-chair of many conferences, including VLDB 2007. He has presented keynote talks at several conferences, including VLDB 2010. His research interests span a variety of topics in data management.