Abstract
Digital Libraries continue to evolve towards research environments supporting access and management of multiform Information Objects spread across multiple data sources and organizational domains. This evolution has introduced the need to deal with Information Objects having traits different from those characterizing Digital Libraries at their early stages and to revise the services supporting their management. Tabular data represent a class of Information Objects that require to be efficiently managed because of their core role in many eScience scenarios. This paper discusses the tabular data characterization problem, i.e., the problem of identifying the reference dataset of any column of the dataset. In particular, the paper presents an approach based on lexical matching techniques to support users during the data curation phase by providing them with a ranked list of reference datasets suitable for a dataset column.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bizer, C., Heath, T., Berners-Lee, T.: Linked data - the story so far. International Journal on Semantic Web & Information Systems 5(3), 1–22 (2009)
Blanke, T., Candela, L., Hedges, M., Priddy, M., Simeoni, F.: Deploying general-purpose virtual research environments for humanities research. Philosophical Transactions of the Royal Society A 368, 3813–3828 (2010)
Borgman, C.: Research data: Who will share what, with whom, when, and why? In: China-North America Library Conference, Beijing (2010)
Borgman, C.: The Conundrum of Sharing Research Data. Journal of the American Society for Information Science and Technology, 1–40 (2011)
Candela, L., Akal, F., Avancini, H., Castelli, D., Fusco, L., Guidetti, V., Langguth, C., Manzi, A., Pagano, P., Schuldt, H., Simi, M., Springmann, M., Voicu, L.: DILIGENT: integrating Digital Library and Grid Technologies for a new Earth Observation Research Infrastructure. International Journal on Digital Libraries 7(1-2), 59–80 (2007)
Candela, L., Castelli, D., Pagano, P.: History, Evolution and Impact of Digital Libraries. In: Iglezakis, I., Synodinou, T.-E., Kapidakis, S. (eds.) E-Publishing and Digital Libraries: Legal and Organizational Issues, ch. 1, pp. 1–30. IGI Global (2011)
Candela, L., Castelli, D., Pagano, P., Simi, M.: From Heterogeneous Information Spaces to Virtual Documents. In: Fox, E.A., Neuhold, E.J., Premsmit, P., Wuwongse, V. (eds.) ICADL 2005. LNCS, vol. 3815, pp. 11–22. Springer, Heidelberg (2005)
Castelli, D.: D4Science-II - An e-Infrastructure Ecosystem for Science. ERCIM News 79, 9 (2009)
Crane, G., Babeu, A., Bamman, D.: eScience and the humanities. International Journal on Digital Libraries 7(1-2), 117–122 (2007)
Gorp, P.V., Mazanek, S.: SHARE: a web portal for creating and sharing executable research papers. Procedia CS 4, 589–597 (2011)
Hamming, R.W.: Error detecting and error correcting codes. Bell System Technical Journal 29(2), 147–160 (1950)
Hey, T., Tansley, S., Tolle, K.: The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research (2009)
Jaro, M.A.: Advances in record linkage methodology as applied to the 1985 census of tampa florida. Journal of the American Statistical Society 84(406), 414–420 (1989)
Krause, E.F.: Taxicab Geometry. Dover Publications (1987)
Lave, J., Wenger: Situated Learning: Legitimate Peripheral Participation. Cam (1991)
Levenshtein, V.: Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady 10, 707–710 (1966)
National Archives and Records Administration. The Soundex Indexing System (2007)
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48(3), 443–453 (1970)
Nowakowski, P., Ciepiela, E., Harezlak, D., Kocot, J., Kasztelnik, M., Bartynski, T., Meizner, J., Dyk, G., Malawski, M.: The collage authoring environment. Procedia CS 4, 608–617 (2011)
Roure, D.D., Goble, C.A., Stevens, R.: The design and realisation of the myexperiment virtual research environment for social sharing of workflows. Future Generation Comp. Syst. 25(5), 561–567 (2009)
Shen, R., Vemuri, N.S., Fan, W., Fox, E.A.: Integration of complex archaeology digital libraries: An ETANA-DL experience. Information Systems 33(7-8), 699–723 (2008)
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. Journal of Molecular Biology 147, 195–197 (1981)
Stapleton, L.K.: Taming Big Data. IBM Data Management Magazine 16(2), 12–18 (2011)
Wallis, J.C., Mayernik, M.S., Borgman, C.L., Pepe, A.: Digital libraries for scientific data discovery and reuse: from vision to practical reality. In: Proceedings of the 10th Annual Joint Conference on Digital Libraries, JCDL 2010, pp. 333–340. ACM, New York (2010)
Wenger, E.: Communities of Practice: Learning, Meaning and Identity. Cambridge University Press (1998)
Winkler, W.E.: String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. In: Proceedings of the Section on Survey Research Methods (American Statistical Association), pp. 354–359 (1990)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Candela, L., Coro, G., Pagano, P. (2013). Supporting Tabular Data Characterization in a Large Scale Data Infrastructure by Lexical Matching Techniques. In: Agosti, M., Esposito, F., Ferilli, S., Ferro, N. (eds) Digital Libraries and Archives. IRCDL 2012. Communications in Computer and Information Science, vol 354. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35834-0_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-35834-0_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35833-3
Online ISBN: 978-3-642-35834-0
eBook Packages: Computer ScienceComputer Science (R0)