Abstract
In the current scientific publishing landscape, there is a need for an authoring workflow that easily integrates data and code into manuscripts and that enables the data and code to be published in reusable form. Automated embedding of data and code into published output will enable superior communication and data archiving. In this work, we demonstrate a proof of concept for a workflow, org-mode, which successfully provides this authoring capability and workflow integration. We illustrate this concept in a series of examples for potential uses of this workflow. First, we use data on citation counts to compute the h-index of an author, and show two code examples for calculating the h-index. The source for each example is automatically embedded in the PDF during the export of the document. We demonstrate how data can be embedded in image files, which themselves are embedded in the document. Finally, metadata about the embedded files can be automatically included in the exported PDF, and accessed by computer programs. In our customized export, we embedded metadata about the attached files in the PDF in an Info field. A computer program could parse this output to get a list of embedded files and carry out analyses on them. Authoring tools such as Emacs + org-mode can greatly facilitate the integration of data and code into technical writing. These tools can also automate the embedding of data into document formats intended for consumption.
Similar content being viewed by others
References
Dominik, C.: The Org Mode 8 Reference Manual: Organize Your Life with GNU Emacs. Samurai Media Limited, Hong Kong (2014)
Elsevier Content Innovations: Content innovation. http://www.elsevier.com/books-and-journals/content-innovation. Accessed 12 June 2015
Hirsch, J.E.: An index to quantify an individual’s scientific research output. Proc. Natl. Acad. Sci. 102(46), 16,569–16,572 (2005). doi:10.1073/pnas.0507655102
Jupyter: Project Jupyter. The Jupyter Project provides a web-browser based computational notebook with a range of computational backends including Python, Julia, R and others. http://jupyter.org/. Accessed 26 June 2015
Kitchin, J.R.: Data sharing in surface science. Surface science (in Press) (2015a). doi:10.1016/j.susc.2015.05.007, http://www.sciencedirect.com/science/article/pii/S0039602815001326
Kitchin, J.R.: Examples of effective data sharing in scientific publishing. ACS Cata. 5(6), 3894–3899 (2015b). doi:10.1021/acscatal.5b00538
Nature: Manuscript formatting guide. http://www.nature.com/nature/authors/gta/index.html#a5.11. Accessed 12 June 2015
Pakin, S.: http://www.ctan.org/tex-archive/macros/latex/contrib/attachfile, v1.5b. Accessed 26 June 2015
PDF Labs: PDFtk the pdf toolkit. https://wwwlabs.com/tools/pdftk-the-pdf-toolkit/. Accessed 26 June 2015
Pérez, F., Granger, B.E.: IPython: a system for interactive scientific computing. Comput. Sci. Eng. 9(3), 21–29 (2007). doi:10.1109/MCSE.2007.53, http://ipython.org
Schulte, E., Davison, D.: Active documents with org-mode. Comput. Sci. Eng. 13(3), 66–73 (2011). doi:10.1109/MCSE.2011.41
Schulte, E., Davison, D., Dye, T., Dominik, C.: A multi-language computing environment for literate programming and reproducible research. J. Stat. Softw. 46(3), 1–24, (2012). http://www.jstatsoft.org/v46/i03
Whitmire, A., Briney, K., Nurnberger, A., Henderson, M., Atwood, T., Janz, M., Kozlowski, W., Lake, S., Vandegrift, M., Zilinski, L.: A table summarizing the federal public access policies resulting from the us office of science and technology policy memorandum of February 2013. figshare (2015). 10.6084/m9.figshare.1372041
Zilinski, L., Scherer, D., Bullock, D., Horton, D., Matthews, C.: Evolution of data creation, management, publication, and curation in the research process. Transp. Res. Rec. J. Transp. Res. Board 2414, 9–19 (2014). doi:10.3141/2414-02
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 Embedding data in images
We use the steganopy (https://pypi.python.org/pypi/steganopy/0.0.1) Python package to illustrate the use of steganography to put data in an image. The point is not that steganography is an ideal way to do this, but that our general approach is flexible. The embedded data could be XMP, or other types of metadata.
1.2 The custom export code
Here we define a custom table exporter. We use the regular table export mechanism, but save the contents of the table as a csv file. We define exports for two backends: LaTeX and HTML. For LaTeX, we use the attachfile [8] package to embed the data file in the PDF. For HTML, we insert a link to the data file, and a data uri link to the HTML output. We store the filename of each generated table in a global variable named *embedded-files* so we can create a new Info metadata entry in the exported PDF.
Next, we define an exporter for source blocks. We will write these to a file too, and put links to them in the exported files. We store the filename of each generated source file in a global variable named *embedded-files* so we can create a new Info metadata entry in the exported PDF.
Here, we define a derived back end for HTML and LaTeX export. These are identical to the standard export back ends, except for the modified behavior of the table and src-block elements.
Finally, here we run the command to generate the exported HTML manuscript.
In addition, here we generate the LaTeX manuscript, and then convert it to PDF. After the PDF is created, we insert the new InfoField into the PDF.
Rights and permissions
About this article
Cite this article
Kitchin, J.R., Van Gulick, A.E. & Zilinski, L.D. Automating data sharing through authoring tools. Int J Digit Libr 18, 93–98 (2017). https://doi.org/10.1007/s00799-016-0173-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-016-0173-7