IPAW 2006: Provenance and Annotation of Data pp 73-81 | Cite as
Mapping Physical Formats to Logical Models to Extract Data and Metadata: The Defuddle Parsing Engine
Abstract
Scientists, motivated by the desire for systems-level understanding of phenomena, increasingly need to share their results across multiple disciplines. Accomplishing this requires data to be annotated, contextualized, and readily searchable and translated into other formats. While these requirements can be addressed by custom programming or obviated by community standardization, neither approach has ‘solved’ the problem. In this paper, we describe a complementary approach – a general capability for articulating the format of arbitrary textual and binary data using a logical data model, expressed in XMLSchema, which can be used to provide annotation and context, extract metadata, and enable translation. This work is based on the draft specification for the Data Format Description Language and our open source “Defuddle” parser. We present an overview of the specification, detail the design of Defuddle, and discuss the benefits and challenges of this general approach to enabling discovery, sharing, and interpretation of diverse data sets.
Keywords
Pacific Northwest National Laboratory XPath Query Persistent Identifier Extract Metadata Logical Data ModelReferences
- 1.Critchlow, T., Lacroix, Z. (eds.): Bioinformatics:Managing Scientific Data. Morgan Kaufmann, San Francisco (2003)Google Scholar
- 2.Lancashire, R., Davies, T.: Spectroscopic Data: The Quest for a Universal Format. Chemistry International 28(1) (January-February 2006)Google Scholar
- 3.Robins, K.D.: Formatting Standards, http://www.ofcm.gov/sai/proceedings/pdf/02_panel2-3.pdf
- 4.netCDF Unidata: netCDF, http://my.unidata.ucar.edu/content/software/netcdf/index.htm
- 5.
- 6.Davies, T.: Cometh a Digital Dark Age?. Chemistry International 24(6) (November 2002)Google Scholar
- 7.Extensible Scientific Interchange Language: http://www.cacr.caltech.edu/SDA/xsil/
- 8.Binary XML Description Language: http://www.edikt.org/binx
- 9.Environmental Science Markup Language: http://esml.itsc.uah.edu/index.jsp
- 10.Environmental Science Markup Language: http://esml.itsc.uah.edu/limitations.html
- 11.Enhanced Ada Subset (EAST): http://east.cnes.fr/english/index.html
- 12.Whiting, M.A., Cowley, W.E., Cramer, N.O., Gibson, A.G., Hohimer, R.E., Scott, R.T., Tratz, S.C.: Enabling Massive Scale Document Transformation for the Semantic Web: the Universal Parsing Agent. In: Proceedings of the 2005 ACM symposium on Document Engineering, pp. 23–25. ACM Press, New York (2005)CrossRefGoogle Scholar
- 13.Data Format Description Language: http://forge.gridforum.org/projects/dfdl-wg
- 14.Cowan, J., Tobin, R. (eds.): XML Information Set W3C Working Draft 16 (March 2001), http://www.w3.org/TR/xml-infoset
- 15.Defuddle Sourceforge Project: http://sourceforge.net/projects/defuddle
- 16.Java Architecture for XML Binding: http://java.sun.com/webservices/jaxb
- 17.Apache JaxMe: http://ws.apache.org/jaxme/
- 18.Scientific Annotation Middleware: http://collaboratory.emsl.pnl.gov/sam/
- 19.Talbott, T.D., Peterson, M.R., Schwidder, J., Myers, J.D.: Adapting the Electronic Laboratory Notebook for the Semantic Era. In: 2005 International Symposium on Collaborative Technologies and Systems, pp. 136–143. IEEE Computer Soc., Los Alamitos (2005)CrossRefGoogle Scholar
- 20.Collaboratory for Multi-Scale Chemical Science: http://cmcs.org
- 21.Myers, J.: Fine-grained References into Binary Data and Data Virtualization Services. In: W3C Workshop on Semantic Web for Life Sciences, Cambridge, Massachusetts USA, October 27-28 (2004)Google Scholar