Mapping Physical Formats to Logical Models to Extract Data and Metadata: The Defuddle Parsing Engine

  • Tara D. Talbott
  • Karen L. Schuchardt
  • Eric G. Stephan
  • James D. Myers
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4145)


Scientists, motivated by the desire for systems-level understanding of phenomena, increasingly need to share their results across multiple disciplines. Accomplishing this requires data to be annotated, contextualized, and readily searchable and translated into other formats. While these requirements can be addressed by custom programming or obviated by community standardization, neither approach has ‘solved’ the problem. In this paper, we describe a complementary approach – a general capability for articulating the format of arbitrary textual and binary data using a logical data model, expressed in XMLSchema, which can be used to provide annotation and context, extract metadata, and enable translation. This work is based on the draft specification for the Data Format Description Language and our open source “Defuddle” parser. We present an overview of the specification, detail the design of Defuddle, and discuss the benefits and challenges of this general approach to enabling discovery, sharing, and interpretation of diverse data sets.


Pacific Northwest National Laboratory XPath Query Persistent Identifier Extract Metadata Logical Data Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Critchlow, T., Lacroix, Z. (eds.): Bioinformatics:Managing Scientific Data. Morgan Kaufmann, San Francisco (2003)Google Scholar
  2. 2.
    Lancashire, R., Davies, T.: Spectroscopic Data: The Quest for a Universal Format. Chemistry International 28(1) (January-February 2006)Google Scholar
  3. 3.
    Robins, K.D.: Formatting Standards,
  4. 4.
  5. 5.
  6. 6.
    Davies, T.: Cometh a Digital Dark Age?. Chemistry International 24(6) (November 2002)Google Scholar
  7. 7.
    Extensible Scientific Interchange Language:
  8. 8.
    Binary XML Description Language:
  9. 9.
    Environmental Science Markup Language:
  10. 10.
    Environmental Science Markup Language:
  11. 11.
    Enhanced Ada Subset (EAST):
  12. 12.
    Whiting, M.A., Cowley, W.E., Cramer, N.O., Gibson, A.G., Hohimer, R.E., Scott, R.T., Tratz, S.C.: Enabling Massive Scale Document Transformation for the Semantic Web: the Universal Parsing Agent. In: Proceedings of the 2005 ACM symposium on Document Engineering, pp. 23–25. ACM Press, New York (2005)CrossRefGoogle Scholar
  13. 13.
    Data Format Description Language:
  14. 14.
    Cowan, J., Tobin, R. (eds.): XML Information Set W3C Working Draft 16 (March 2001),
  15. 15.
    Defuddle Sourceforge Project:
  16. 16.
    Java Architecture for XML Binding:
  17. 17.
  18. 18.
    Scientific Annotation Middleware:
  19. 19.
    Talbott, T.D., Peterson, M.R., Schwidder, J., Myers, J.D.: Adapting the Electronic Laboratory Notebook for the Semantic Era. In: 2005 International Symposium on Collaborative Technologies and Systems, pp. 136–143. IEEE Computer Soc., Los Alamitos (2005)CrossRefGoogle Scholar
  20. 20.
    Collaboratory for Multi-Scale Chemical Science:
  21. 21.
    Myers, J.: Fine-grained References into Binary Data and Data Virtualization Services. In: W3C Workshop on Semantic Web for Life Sciences, Cambridge, Massachusetts USA, October 27-28 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Tara D. Talbott
    • 1
  • Karen L. Schuchardt
    • 1
  • Eric G. Stephan
    • 1
  • James D. Myers
    • 2
  1. 1.Pacific Northwest National LaboratoryRichlandUSA
  2. 2.National Center for Supercomputing ApplicationsUrbanaUSA

Personalised recommendations