A Python library for FAIRer access and deposition to the Metabolomics Workbench Data Repository
- 1k Downloads
The Metabolomics Workbench Data Repository is a public repository of mass spectrometry and nuclear magnetic resonance data and metadata derived from a wide variety of metabolomics studies. The data and metadata for each study is deposited, stored, and accessed via files in the domain-specific ‘mwTab’ flat file format.
In order to improve the accessibility, reusability, and interoperability of the data and metadata stored in ‘mwTab’ formatted files, we implemented a Python library and package. This Python package, named ‘mwtab’, is a parser for the domain-specific ‘mwTab’ flat file format, which provides facilities for reading, accessing, and writing ‘mwTab’ formatted files. Furthermore, the package provides facilities to validate both the format and required metadata elements of a given ‘mwTab’ formatted file.
In order to develop the ‘mwtab’ package we used the official ‘mwTab’ format specification. We used Git version control along with Python unit-testing framework as well as continuous integration service to run those tests on multiple versions of Python. Package documentation was developed using sphinx documentation generator.
The ‘mwtab’ package provides both Python programmatic library interfaces and command-line interfaces for reading, writing, and validating ‘mwTab’ formatted files. Data and associated metadata are stored within Python dictionary- and list-based data structures, enabling straightforward, ‘pythonic’ access and manipulation of data and metadata. Also, the package provides facilities to convert ‘mwTab’ files into a JSON formatted equivalent, enabling easy reusability of the data by all modern programming languages that implement JSON parsers. The ‘mwtab’ package implements its metadata validation functionality based on a pre-defined JSON schema that can be easily specialized for specific types of metabolomics studies. The library also provides a command-line interface for interconversion between ‘mwTab’ and JSONized formats in raw text and a variety of compressed binary file formats.
The ‘mwtab’ package is an easy-to-use Python package that provides FAIRer utilization of the Metabolomics Workbench Data Repository. The source code is freely available on GitHub and via the Python Package Index. Documentation includes a ‘User Guide’, ‘Tutorial’, and ‘API Reference’. The GitHub repository also provides ‘mwtab’ package unit-tests via a continuous integration service.
KeywordsmwTab Metabolomics Workbench mwtab Python package Data validation FAIR
The Metabolomics Workbench Data Repository is a publicly available resource for metabolomics experimental data collected from mass spectrometry (MS) and nuclear magnetic resonance (NMR) analytical platforms and associated metadata describing sample and analytical details as well as experimental design (Sud et al. 2016). Study-specific experimental data and metadata can be accessed via metabolomics workbench in the form of ‘mwTab’ formatted files as well as through a representational state transfer (REST) interface. The repository currently makes available over 630 individual ‘mwTab’ files from MS- and NMR-based studies, each file having an associated study id (non-unique identifier) and analysis id (unique identifier). The metabolomics workbench provides an official data format specification (“mwTab format specification.”—Available at: http://www.metabolomicsworkbench.org/data/tutorials.php) for the ‘mwTab’ format, which consists of sequentially ordered blocks (sections) of text data. Some of the blocks consist of data represented by ‘single key to single value’ relationships that store single pieces of information. Other blocks consist of multiple ‘tab’-separated values via ‘single key to multiple values’ or ‘multiple keys to multiple values’ relationships that store multiple pieces of information in an organized manner analogous to a relational table.
2.1 Overview of the mwTab format
The ‘mwTab’ formatted files consists of multiple blocks of text data. Each new text block of the ‘mwTab’ file starts with the ‘#’. There are several types of formatting possible within text blocks: “single key to single value”-like pairs to represent single piece of information, e.g. ‘VERSION’ is the key and ‘1’ is the value (see Fig. 1a). In cases where value is long, it gets formatted as a multiline string with repeated use of the same key, e.g. ‘PR:PROJECT_SUMMARY’ is the key and associated multiline project summary is the value (see Fig. 1a). There is also a ‘SUBJECT_SAMPLE_FACTORS’ block that contains header specifying column names and corresponding ‘tab’-separated rows of data (see Fig. 1b). Results from MS- and NMR-based experiments are deposited as large matrices of values with corresponding units for each of the assignable metabolites (see Fig. 1c, d respectively).
The full ‘mwTab’ format specification is available on official Metabolomics Workbench Data Repository (mwTab format specification. [Online]. Available: http://www.metabolomicsworkbench.org/data/tutorials.php).
2.2 Package implementation
The ‘mwtab’ Python package consists of several modules: ‘mwtab.py’, ‘tokenizer.py’, ‘fileio.py’, ‘converter.py’, ‘mwschema.py’, ‘validator.py’ and ‘cli.py’ (see Fig. 2). The ‘mwtab.py’ module (Fig. 2b) implements the ‘MWTabFile’ class which can construct itself into a Python nested dictionary- and list-based data structures representation from a provided file in ‘mwTab’ format. The ‘MWTabFile’ class is the main class that provides the interfaces for data and metadata access and manipulation. The dictionary-based data structures provide key-based bracket accessors (i.e., ‘’) and the list-based data structures provide index-based bracket accessors (i.e., 0, 1, 2, etc.). This makes the ‘mwtab’ package a useful general-purpose library with intuitive (‘pythonic’) data access and manipulation functionality that can be integrated into higher level Python software used for downstream data analysis. The ‘tokenizer.py’ module is responsible for tokenization (lexical analysis) of the text in ‘mwTab’ format, i.e. it splits the raw text into tokens and passes them to the ‘mwtab.py’ module. Next, the ‘mwtab.py’ analyzes the tokens (syntactic analysis) and reformats them into a ‘MWTabFile’ instance with Python dictionary- and list-based instances (objects). The ‘fileio.py’ module (see Fig. 2c) is responsible for input/output operations with files from different sources. Specifically, it provides the ‘GenericFilePath’ class and memory-efficient generator (function) that can return (yield) ‘MWTabFile’ instances from different sources, e.g. single file, directory of files, archive of files on a local machine, URL address of the ‘mwTab’ formatted file, etc. Function (method) call diagram (see Fig. S1) shows how three modules ‘mwtab.py’, ‘tokenizer.py’, and ‘fileio.py’ work together during the ‘MWTabFile’ instance construction: the ‘fileio.read_files()’ method uses ‘fileio.GenericFilePath’ in order to determine what sources the ‘mwTab’ formatted file is coming from and then calls appropriate methods on the ‘mwtab.MWTabFile’ class in order to construct itself, i.e. top-level ‘mwtab.MWTabFile._build_mwtabfile’ and then ‘mwtab.MWTabFile._build_block’ in order to build each individual text block of the ‘mwTab’ formatted file into a usable ‘MWTabFile’ instance.
Since the ‘MWTabFile’ class is constructed using Python’s standard dictionary- and list-based data structures, the entire ‘MWTabFile’ instance can easily be serialized into an equivalent JSON representation. The ‘converter.py’ module (see Fig. 2d) is responsible for conversion between the JSONized representation of the ‘mwTab’ format and the regular ‘mwTab’ format. The ‘mwschema.py’ and ‘validator.py’ are two package modules designed to perform the validation of the ‘mwTab’ formatted files. The ‘mwschema.py’ provides the current schema definitions for the ‘mwTab’ format and the ‘validator.py’ module provide functions to validate individual text blocks as well as the entire ‘mwTab’ formatted file using those schema definitions. The schema definitions are implemented using the ‘schema’ Python library (“schema—validation just got Pythonic.”—Available at: https://github.com/keleshev/schema/). The ‘cli.py’ module provides a simple command-line interface that can be used to convert ‘mwTab’ formatted files to their JSON representation and back as well as validate files on the command-line. The command-line interface is implemented with the help of the ‘docopt’ Python library (“docopt—creates beautiful command-line interfaces.”—Available at: https://github.com/docopt/docopt).
The ‘__init__.py’ and ‘__main__.py’ (see Fig. 2a) are special Python specific modules (‘__init__.py’ marks ‘mwtab’ top-level directory as a Python package and ‘__main__.py’ specifies the top-level environment for the ‘mwtab’ package).
2.3 Evaluation data
To evaluate the ‘mwtab’ package functionality and performance, we used all ‘mwTab’ formatted files available from Metabolomics Workbench Data Repository on August 30, 2017. Due to the fact that there was no easy way to download all ‘mwTab’ formatted files from the repository, we created a specialized Python script that downloads every single ‘mwTab’ formatted file using ‘STUDY_ID’.
2.4 Evaluation of mwtab package
The ‘mwtab’ Python package is available within a version-controlled GitHub repository under a Berkeley Software Distribution 3-clause clear open source license (BSD 3-Clause Clear License). As a part of our development process, we implemented several unit tests for each module of the ‘mwtab’ package (see Fig. 2a) in order to validate functionality using the popular Python unit testing framework ‘pytest’ (“pytest unit testing framework.”—Available at: https://docs.pytest.org). We used the continuous integration service ‘Travis CI’ to build and test our ‘mwtab’ package against Python versions 2.7, and 3.4+, build information is available under the ‘mwtab’ package GitHub repo. In addition, we generated code test coverage reports that are also available under the GitHub repo (currently, tests cover 90% of the code base).
2.5 The mwtab package documentation
Each function, class, and class method was documented using sphinx python documentation style, which allowed us to generate package API documentation directly from the source code. In addition, we wrote ‘User Guide’, ‘Tutorial’, and ‘API Reference’ documentation which is available under http://mwtab.readthedocs.io.
3.1 JSON representation of the mwTab format
Figure S2 compares different text blocks in ‘mwTab’ format with their corresponding JSONized representation: text blocks containing “single key-single value” and multiline summary text blocks (Fig. S2a, b), specially formatted subject sample factors text blocks (Fig. S2c, d), text blocks containing MS experimental data on metabolites (Fig. S2e, f), and text block containing NMR experimental data on metabolites (Fig. S2g, h).
In comparison to the standard ‘mwTab’ format, the main advantages of the JSON representation are: (i) it enables easy access to data from other programming languages without implementing specific ‘mwTab’ parser for that language; and (ii) it enables faster reading/processing of the data stored in ‘mwTab’ formatted files due to highly optimized and efficient JSON parsers. Figures S5 and S6 show code examples for data access from JSONized ‘mwTab’ files using R with ‘jsonlite’ R library (Ooms 2014) and C++ with ‘JSON for Modern C++’ library (“JSON for Modern C++.”—Available at: https://github.com/nlohmann/json), respectively.
3.2 The mwtab package interface
The ‘mwtab’ package can be used in several ways: (i) as a library within Python scripts for accessing and manipulating data and metadata stored in ‘mwTab’ formatted files; and (ii) as a command-line tool to convert between the ‘mwTab’ format and its equivalent JSONized representation as well as for data validation using predefined schema definitions for each of the text blocks and consistency checking.
To use ‘mwtab’ package as a library within Python scripts, first it is necessary to import it within a Python program or an interactive interpreter interface. Next, the ‘MWTabFile’ instance(s) can be created using the generator function ‘read_files’. This generator function instantiates ‘MWTabFile’ object(s) from many different file sources: a local file, a URL address of a file, ‘ANALYSIS ID’ of a file, directory and/or archive of multiple files. The generator function can be processed in several ways: for example, to process files one at a time by calling the Python ‘next()’ built-in function, to process every file in a for-loop, or to convert the generator into list of ‘MWTabFile’ instances. Once the ‘MWTabFile’ object is created, it can be utilized like any Python built-in dictionary- and list-based data structures, the data can be accessed and/or manipulated using keys (in case of dictionary) or indexes (in case of list). Table 1 summarizes common patterns for using ‘mwtab’ as a library, but more detailed examples are available under the ‘mwtab’ package tutorial.
Common patterns for using the ‘mwtab’ as a library
mwt_generator = mwtab.read_files(‘path_to_file’)
mwtfile = next(mw_generator)
mwtfile[‘PROJECT’][‘PROJECT_SUMMARY’] = ‘new project summary’
The ‘mwtab’ package also provides a simple command-line interface that can be used to validate and convert files from ‘mwTab’ format to its JSON representation and back. Figure S3 shows the current command-line interface.
Table 2 summarizes common patterns for using ‘mwtab’ as a command- line tool, but the ‘mwtab’ package tutorial documentation provides more detailed examples.
Common patterns for using the ‘mwtab’ as a command-line tool
Convert between ‘mwTab’ and its JSON representation
$ python3 -m mwtab convert AN000001.txt AN000001.json \
$ python3 -m mwtab convert AN000001.json AN000001.txt \
$ python3 -m mwtab validate AN000001.txt
$ python3 -m mwtab validate AN000001.json
3.3 Data validation functionality of the mwtab package
The ‘mwtab’ Python package provides two modules designed to perform data validation: ‘validator.py’ and ‘mwschema.py’. Once the ‘mwTab’ formatted file is parsed into a ‘MWTabFile’ instance (object), the data can be validated against a predefined schema. The ‘mwschema.py’ module provides schema definitions based on the official ‘mwTab’ format specification for each text block of the ‘mwTab’ file. For example, Figure S4a shows an example for ‘#PROJECT’ text block from the ‘mwTab’ specification: it specifies that the ‘PROJECT_TITLE’, ‘PROJECT_SUMMARY’, ‘INSTITUTE’, ‘LAST_NAME’, ‘FIRST_NAME’, ‘ADDRESS’, ‘EMAIL’, and ‘PHONE’ fields are required, and ‘PROJECT_TYPE’, ‘DEPARTMENT’, ‘LABORATORY’, ‘FUNDING_SOURCE’, ‘PROJECT_COMMENTS’, ‘PUBLICATIONS’, ‘CONTRIBUTIONS’, and ‘DOI’ fields are optional. If a text block is missing a required filed, a descriptive error message will be raised during the validation process. Figure S4b shows an example of an error message that is generated due to missing the ‘PROJECT_TITLE’ required field. The type of value that is expected by the schema definition can also be provided. In addition to Python standard built-in types (e.g., ‘str’, ‘int’, ‘float’, etc.), regular expressions can be provided where appropriate to validate data, e.g. a regular expression can be passed as a value for ‘EMAIL’ and ‘PHONE’ fields to verify that they correspond to valid e-mail and phone formats. All schema definitions are provided within the ‘mwschema.py’ module in the GitHub repository and can be easily modified to strengthen the data validation functionality.
3.4 The mwtab package performance
In order to test the performance of our ‘mwtab’ Python package, we downloaded every ‘mwTab’ formatted file and created simple Python script that imports the library, instantiates the ‘MWTabFile’ objects from directory of files one file at a time and reports how much time it took. Table 3 shows that that it took under 30 s to process files in both ‘mwTab’ and its JSON representation. Table 3 also shows that the JSON representation is more verbose and therefore occupies more disk space; however, its main benefit is in providing easy access to the data and metadata for other programming languages with JSON parsers but no ‘mwTab’ parser.
The ‘mwtab’ package performance against ‘mwTab’ and its JSON representation formats
JSON representation of ‘mwTab’
Number of files
Total size of files (MB)
In order to reduce disk space usage, the entire directory can be converted and compressed into a single archive file. The ‘mwtab’ package provides facilities to read directly from zip, tar.gz, and tar.bz2 archives without requirement to decompress those files manually.
3.5 Evaluating format and metadata in mwTab formatted files
During the development of ‘mwtab’ Python package, we were able to identify several inconsistencies between the official specification and the actual ‘mwTab’ formatted files provided by the Metabolomics Workbench at that time. For example, the ‘mwTab’ format specification says that each file has to start with ‘#METABOLOMICS WORKBENCH’ header string, but multiple files had single or multiple empty lines at the beginning of a file. Also, we were able to identify optional fields that were present within actual ‘mwTab’ formatted files but missing in the official specification. To demonstrate the type of exploratory analysis that can be performed using ‘mwtab’ package, we analyzed every ‘mwTab’ file in order to verify that mandatory data units fields are present and provide an actual units value. Figure 3 shows that 243 entries had missing value (the first bar on Fig. 3) for the required units field and some of the bars represent the same type units but split due to slight difference in their names (e.g. ‘Peak height’ vs. ‘peak height’, ‘Peak Intensity’ vs. ‘peak intensity’).
Additionally, we found that 302 were missing ‘#END’ statement that signals the end of the file, seven files had issues in their ‘#SUBJECT_SAMPLE_FACTORS’ text block, 20 files had issues in their ‘single key-single value’ pairs, and four files had some other minor formatting issues.
All of the ‘mwTab’ file format issues discovered with the help of ‘mwtab’ Python package were reported to Metabolomics Workbench and were promptly fixed within 1 week. The original files can be downloaded from a figshare repository along with the validation reports generated for each file. Current cleaned up files are available on the Metabolomics Workbench Data Repository.
The ‘mwtab’ package is a useful Python library designed to provide facilities for parsing, accessing, and manipulating data stored in ‘mwTab’ and its JSONized equivalent representation. The JSONized representation provides several advantages to standard ‘mwTab’ format including improved reading speeds and enabling easy data access for other programming languages implementing JSON parser. Using internal Python and JSON data structures, ‘mwTab’ files can be validated with respect to consistency and completeness using specified schema definitions based on the official ‘mwTab’ format specification. The library has already proven useful in improving the quality of all ‘mwTab’ formatted files provided by the Metabolomics Workbench Data Repository, with respect to the official ‘mwTab’ format specification. The ‘mwtab’ package also provides an easy-to-use command-line interface designed to perform file conversion and file validation tasks. The ‘mwtab’ package provides extensive documentation, which includes a ‘User Guide’, ‘Tutorial’, and ‘API reference’ generated automatically from the source code and available on http://mwtab.readthedocs.io. The ‘mwtab’ package also includes automated unit-tests that perform testing of every module of the package as well as generates test coverage reports. We believe that the ‘mwtab’ package will help to improve metadata quality and data reusability of metabolomics data from Metabolomics Workbench Data Repository by downstream investigators through providing Python interfaces for data access and manipulation and through providing a JSONized representation of the ‘mwTab’ format for use in other programming languages.
The authors wish to thank Eoin Fahy, Dawn Cotter, and other Metabolomics Workbench staff for providing the official ‘mwTab’ format files specification as well as for the opportunity to provide feedback on ‘mwTab’ files via the MW usability meeting and helpful discussions.
This work was supported in part by the National Science Foundation grant NSF 1252893 (Hunter N.B. Moseley) and the National Institutes of Health grant NIH 1U24DK097215-01A1 (Richard M. Higashi, Teresa W.-M. Fan, Andrew N. Lane, and Hunter N.B. Moseley).
Compliance with ethical standards
Conflict of interest
The authors declare that they have no conflict of interest.
Research involving human participants and/or animals
This article does not contain any studies with human participants or animals performed by any of the authors.
This article does not contain any studies with human participants performed by any of the authors.
- Ooms, J. (2014). The jsonlite package: A practical and consistent mapping between JSON data and R objects. arXiv:1403.2805.
- Python Software Foundation. (2013). Python language reference, version 2.7. Wilmington: Python Software Foundation.Google Scholar
- Sud, M., Fahy, E., Cotter, D., Azam, K., Vadivelu, I., Burant, C., Edison, A., Fiehn, O., Higashi, R., Nair, K. S., Sumner, S., & Subramaniam, S. (2016). Metabolomics workbench: An international repository for metabolomics data and metadata, metabolite standards, protocols, tutorials and training, and analysis tools. Nucleic Acids Research, 44(D1), D463–D470.CrossRefPubMedGoogle Scholar
- Van Rossum, G., & Drake, F. L. (2010). The Python library reference (pp. 1–1144). Wilmington: Python Software Foundation.Google Scholar
- Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., Gonzalez-Beltran, A., Gray, A. J. G., Groth, P., Goble, C., Grethe, J. S., Heringa, J.,’t Hoen, P. a., Hooft, R., Kuhn, T., Kok, R., Kok, J., Lusher, S. J., Martone, M. E., Mons, A., Packer, A. L., Persson, B., Rocca-Serra, P., Roos, M., van Schaik, R., Sansone, S.-A., Schultes, E., Sengstag, T., Slater, T., Strawn, G., Swertz, M. A., Thompson, M., van der Lei J., van Mulligen, E., Velterop, J., Waagmeester, A., Wittenburg, P., Wolstencroft, K., Zhao, J., Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3, 160018.CrossRefPubMedPubMedCentralGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.