The remainder of this review provides readers with an experiential learning opportunity (Kolb 1984) using an example interactive metabolomics data analysis workflow deployed using a combination of Python, Jupyter Notebooks, and Binder. We assume that the initial stage of data-processing for the computational workflow (converting raw instrument files into an annotated data table) has already been completed, and that a deconvolved, but not necessarily annotated, data table has been created and checked for errors. These assumptions are made to make the learning objectives presented manageable, not as a directive for obfuscating the complete metabolomics workflow. It is possible, and encouraged, to include all data processing steps in interactive notebooks. The tutorial takes the reader through the process of using interactive notebooks to produce a shareable, reproducible data analysis workflow that connects the study design to reported biological conclusions in an interactive document, using data from two previously published metabolomics studies. This workflow includes a discrete set of interactive and interlinked procedures: data cleaning, univariate statistics, multivariate machine learning, feature selection, and data visualisation (Fig. 2).
The following five tutorials have been pedagogically designed to lead the reader through increasing levels of cognitive complexity, according to Bloom’s revised taxonomy (Anderson et al. 2001):
Launch and walk through a published Jupyter notebook using Binder in the cloud to duplicate a set of results.
Interact with and edit the content of a published Jupyter notebook using Binder in the cloud to understand workflow methods.
Install Python and use published Jupyter Notebooks on the researcher’s computer to apply and experiment with workflow methods locally.
Create a metabolomics Jupyter notebook on a local computer.
Deploy the Jupyter notebook from Tutorial 4 on Binder in the cloud via GitHub.
Overview of Jupyter/GitHub/Binders
Before beginning the tutorial, we review some fundamental concepts behind Jupyter Notebooks, GitHub, and Binder, as understanding these can aid successful independent execution of this open-science approach (Fig. 3). All code embedded in each of the example notebooks is written in the Python programming language and is based upon extensions of popular open source packages with high levels of community uptake and support. These include: Numpy for matrix-based calculations (van der Walt et al. 2011); Pandas for high level data table manipulation (McKinney 2017); Scikit-learn for machine learning (Pedregosa et al. 2011); and Matplotlib (Hunter 2007), Bokeh (Bokeh Development Team 2018), Seaborn (Waskom et al. 2018), and BeakerX (Beaker X Development Team 2018) for data visualisation. Additionally, we deploy a simple package called ‘cimcb-lite’, developed by the authors for this publication, that integrates the functionality of the above packages into a set of basic methods specific to metabolomics. A tutorial on the Python programming language itself is beyond the scope of this publication, but we hope that the code presented is sufficiently well-documented in each notebook to be understood. Many excellent publications can be consulted for an in-depth introduction to using Python for data science (Jones 2013; Ramalho 2015; The Carpentries 2019; VanderPlas 2016).
Digital object identifiers (DOI) are widely used to identify academic and government information in the form of journal articles, research reports and data sets. It is also possible to assign a DOI to open access software. Specifically, researchers are able to make the work shared on GitHub citable by archiving with a data archiving tool such as Zenodo (www.zenodo.org) (Sicilia et al. 2017). A detailed tutorial is available (Open Science MOOC 2018). This archiving tool will ‘fix’ in time a given repository (e.g. Jupyter notebook and meta data), so that it can be associated with a particular static publication, while allowing the programmer to further develop the notebook on GitHub. The tutorials in this paper are archived with the handle https://doi.org/10.5281/zenodo.3362624 (https://doi.org/10.5281/zenodo.3362624).
Jupyter Notebook (jupyter.org) is a powerful, open-source, browser-based tool for interactive development and presentation of data science projects. Each notebook consists of a collection of executable cells, and each cell contains either text formatted using the Markdown language (Gruber 2004) or executable code (usually Python or R). When a ‘code cell’ is executed any graphical or text output (numerical results, figures or tables) is presented within the document immediately below the cell. Figure 4 shows an example of a notebook after execution. A popular way to get started with Jupyter Notebooks is to install the Anaconda distribution (anaconda.com), for which graphical installers are available on Windows, macOS and Linux operating systems (anaconda.com/distribution/). After installation a local Jupyter server can be launched using the Anaconda-Navigator application. To run a specific local Jupyter notebook with Anaconda-Navigator the user can navigate to the appropriate local folder using the browser-based interface, and click on the desired notebook file (which can be identified by the .ipynb suffix).
GitHub (github.com) is a cloud-based web service that helps programmers store, manage, and share their code (and associated data files), as well as track and control changes to their code (version control). It is free to sign up and host a public code repository, which makes GitHub especially popular with open-source projects and a good choice for distributing Jupyter Notebooks, project-specific code and documentation. Jupyter Notebooks stored publicly on GitHub can be downloaded and run on a local machine using Anaconda or linked to a cloud-based platform. To complete all the steps of this tutorial a (free) GitHub account is required. An account at GitHub may be created by clicking “sign up” on the GitHub home page (github.com) and following the instructions.
Binder (mybinder.org) is an open source web service that allows users to deploy a GitHub repository comprising a collection of Jupyter Notebooks (with configuration files that describe the required computing environment) as a temporary cloud-based virtual machine. The Binder deployment is accessible by web browser and includes the programming language and all necessary packages and data. As with all publicly-accessible cloud storage care must be taken if data are sensitive or private. Researchers can launch the virtual machine in their browser but, because the user environment is temporary, once the session is closed all new results are lost. If changes are made, the user must download any changed files or output they wish to keep.
Tutorial 1: launching and using a Jupyter Notebook on Binder
This tutorial demonstrates the use of computational notebooks for transparent dissemination of data analysis workflows and results. The tutorial steps though a metabolomics computational workflow implemented as a Jupyter Notebook and deployed on Binder. The workflow is designed to analyse a deconvolved and annotated metabolomics data set (provided in an Excel workbook) and is an example of the standard data science axiom: Import, Tidy, Model, and Visualise.
The Jupyter notebook for this tutorial is named Tutorial1.ipynb and is available at GitHub in the repository https://github.com/cimcb/MetabWorkflowTutorial. This repository can be downloaded (cloned) to the researcher’s own computer, or run on the Binder service. In the text we assume that the tutorial is being run using the Binder service. To open the notebook on Binder, go to the tutorial homepage: https://cimcb.github.io/MetabWorkflowTutorial and click on the topmost “Launch Binder” icon to “launch the tutorial environment in the cloud”. It will take a short while for Binder to build and deploy a new temporary virtual machine. Once this is ready the Jupyter notebook landing page will show the files present in this copy of the GitHub repository (Supplementary Fig. 1).
The tutorial workflow analysis interrogates a published dataset used to discriminate between samples from gastric cancer and healthy patients (Chan et al. 2016). The dataset is available in the Metabolomics Workbench database (http://www.metabolomicsworkbench.org, Project ID PR000699). For this tutorial, the data are stored in the Excel workbook GastricCancer_NMR.xlsx using the Tidy Data framework (Wickham 2014): each variable is a column, each observation is a row, and each type of observational unit is a table. The data are split into two linked tables. The first, named ‘Data’, contains data values related to each observation. i.e. metabolite concentrations M1 … Mn, together with metadata such as: ‘sample type’, ‘sample identifier’ and ‘outcome class’. The second, named ‘Peak’, contains data that links each metabolite identifier (Mi) to a specific annotation and optional metadata (e.g. mass, retention time, MSI identification level, number of missing values, quality control measures, etc.). The Excel file can also be downloaded from the Binder virtual machine for inspection on your own machine by selecting the checkbox next to the filename and clicking on the Download button in the top menu (Supplementary Fig. 1).
To begin the tutorial, click on the Tutorial1.ipynb filename (Supplementary Fig. 1). This will open a new tab in your browser presenting the Jupyter notebook (Supplementary Fig. 2). At the top of the page there is a menu bar and ribbon of action buttons similar to those found in other GUI-based software, such as Microsoft Word. The interface is powerful, and it is worth taking time to become familiar with it, but for this tutorial only the “Run” button and the “Cell” and “Kernel” drop down menus are required.
The rest of the page is divided into “code cells” and “text cells”. The “text cells” briefly outline the context and computation of the “code cells” beneath them. Code and text cells can be distinguished by their background colour (code cells are slightly grey, text cells are slightly red), by the text formatting (code cells have a fixed-width font, text cells have word processor-like formatting), and the “In :” marker text is present next to each code cell.
To run a single code cell, first select it by clicking anywhere within the cell, which will then be outlined by a green box (if you select a text cell, this box is blue—Supplementary Fig. 3). Once a cell is selected, the code in the cell can be executed by clicking on the “Run” button in the top menu. Multiple cells can also be run in sequence by choosing options from the dropdown list in the “Cell” menu item. The options include “Run All” (runs all the cells in the notebook, from top to bottom), and “Run all below” (run all cells below the current selection). These can be used after changing the code or values in one cell to recalculate the contents of subsequent cells in the notebook.
The “computational engine” that executes the code contained in a notebook document is called the kernel, and it runs continually in the background while that notebook is active. When you run a code cell, that code is executed by the kernel and any output is returned back to the notebook to be displayed beneath the cell. The kernel stores the contents of variables, updating them as each cell is run. It is always possible to return to a “clean” state by choosing one of the “Restart Kernel” options from the “Kernel” menu item’s dropdown list. Selecting “Restart & Run All” from the “Kernel” dropdown menu will restart the kernel and run all cells in order from the start to the end of the notebook.
Beginning from a freshly-loaded Tutorial1.ipynb notebook in the Binder, clicking on “Cell->Run All” or “Kernel->Restart & Run All” will produce a fully executed notebook that matches the output in the static supplementary html file Tutorial1.html (cimcb.github.io/MetabWorkflowTutorial/Tutorial1.html). Choosing “Restart and Clear Outputs” from the “Kernel” dropdown menu, will reset the notebook and clear all data from memory and remove any outputs, restoring its original state.
The tutorial can be completed by reading the text cells in the notebook and inspecting, then running, the code in the corresponding code cells. This is an example of “Literate Programming” that weaves traditional computing source code together with a human-readable, natural language description of the program logic (Knuth 1984). The notebook interface makes notable advances on the original proposition for literate programming that are used in this tutorial, the most significant of which is that the output of running the code is also incorporated into the document. The browser interface allows for further enhancements, such as hyperlinks to external webpages for explanations and further reading about technical terms, embedded interactive spreadsheet-like representation of large datasets (e.g. section 2. Load Data and Peak Sheet), and embedded interactive graphical output (e.g. section 4. PCA Quality Assessment).
Tutorial 2: interacting with and editing a Jupyter Notebook on Binder
The second tutorial is interactive and showcases the utility of computational notebooks for both open collaboration and experiential education in metabolomics data science. Tutorial 2 is accessed on GitHub through the same process as described for Tutorial 1. To open the notebook on Binder, go to the tutorial homepage: https://cimcb.github.io/MetabWorkflowTutorial and click on the topmost “Launch Binder” icon to “launch the tutorial environment in the cloud”, then click the Tutorial2.ipynb link on the Jupyter landing page. This will present a new tab in your browser containing the second tutorial notebook. The functionality of this notebook is identical to Tutorial 1, but now the text cells have been expanded into a comprehensive interactive tutorial. Text cells, with a yellow background, provide the metabolomics context and describe the purpose of the code in the following code cell. Additional coloured text boxes are placed throughout the workflow to help novice users navigate and understand the interactive principles of a Jupyter Notebook:
Action (red background labelled with ‘gears’ icon)
Red boxes provide suggestions for changing the behaviour of the subsequent code cell by editing (or substituting) a line of code. For example, the first red cell describes how to change the input dataset by changing the path to the source Excel file.
Interaction (green background with ‘mouse’ icon)
Green boxes provide suggestions for interacting with the visual results generated by a code cell. For example, the first green box in the notebook describes how to sort and colour data in the embedded data tables.
Notes (blue background with ‘lightbulb’ icon)
Blue boxes provide further information about the theoretical reasoning behind the block of code or a given visualisation. This information is not essential to understand Jupyter Notebooks but may be of general educational utility and interest to new metabolomics data scientists.
To complete the tutorial, first execute the notebook by selecting the “Restart & Run All” option in the “Kernel” dropdown menu. Move through the notebook one cell at a time reading the text and executing the code cells. When prompted, complete one (or multiple) modifications suggested in each ‘action’ box, and the click “Run all below” from the “Cell” dropdown menu, observing the changes in cell output for all the subsequent cells. Further guidance is included in the notebook itself.
It is possible to save the edited notebook to the Binder environment, but any changes made to the notebook during the tutorial are lost when the Binder session ends. To keep changes made to the tutorial notebook or its output, modified files must be downloaded to your local computer before you end the session. Modified files can also be downloaded from the Jupyter landing page. To download files, click the checkbox next to each file you wish to download, and then click the ‘Download’ button from the top menu.
Tutorial 3: downloading and installing a Jupyter Notebook on a local machine
Jupyter Notebooks can be run on a standard laptop or desktop computer in a number of different ways, depending on the operating system. The Anaconda distribution provides a unified, platform-independent framework for running notebooks and managing Conda virtual environments that is consistent across multiple operating systems, so for convenience we will use the Anaconda interface in these tutorials.
To install the Anaconda distribution, first download the Python 3.x Graphical Installer package from the Anaconda webpage (https://www.anaconda.com/distribution/) then open the installer and follow the instructions to compete the installation (https://docs.anaconda.com/anaconda/install/). Be sure to download the installer package specific to your computer’s operating system (e.g. macOS, Microsoft Windows or Linux). When the process is completed, the “Anaconda Navigator” application will be installed in your applications folder.
To start Jupyter on your machine first launch the Anaconda Navigator application. This will display a home screen with a sidebar menu on the left-hand side and the main area showing a panel of application icons, with short descriptions. Locate the Jupyter Notebook application and icon in this panel and click the “launch” button under the icon. This will start a Jupyter web server and open the Jupyter landing page in your default web browser. To run an existing Jupyter notebook, navigate to the appropriate folder on your computer’s filesystem in the Jupyter landing page, and click on the notebook (.ipynb) file you wish to open. To end a Jupyter session, click on the “quit” button in the top right-hand corner of the Jupyter landing page. Quit now if you have been working along.
To run the Tutorial notebooks, we need to download the tutorial repository containing those notebooks from GitHub and set up a local “virtual environment” that contains the programming libraries and software tools necessary to run the code cells in the notebooks.
To download the notebook and associated files from the Github repository page (https://github.com/cimcb/MetabWorkflowTutorial), click on the green button labelled “clone or download” and choose the option to “Download ZIP”. Save the zip file (MetabWorkflowTutorial-master.zip) in a convenient location. Extract the zip file to create a new folder in the same location as the .zip file, called “MetabWorkflowTutorial-master”. The contents of this folder are the files visible in the repository at the GitHub site. We will refer to this folder as the “repository root”, or just “root”.
The Jupyter Notebooks in the repository require several Python packages to be installed in order to be run successfully. It would be possible to install these on the local computer so that they are visible to, and accessible by, all notebooks on the computer. However, it is often the case that different repositories and projects require alternative, incompatible versions of these packages. So, in practice, it is not usually possible to install a single set of packages that meets the needs of all the projects that a user would want to run. A technical solution to this is to create a new “virtual environment” that contains only the packages necessary for a project to run, and keeps them separate (“sandboxes” them) from any other projects. Environments can be created when required, and deleted when no longer necessary, without affecting other projects or the operation of the computer. It is good practice to create a new virtual environment for each project, and typical that multiple such environments are set up, and exist simultaneously on the same computer. The Anaconda Navigator application provides an interface for creating and managing these virtual environments.
To create a new virtual environment for the tutorial, first open the Anaconda Navigator application and click on “Environments” in the left-hand sidebar. The main panel will change to list any virtual environments that have been created using Anaconda. If no environments have been created only “base (root)” will be listed. To the right of each virtual environment Anaconda Navigator lists the packages that have been installed in that environment.
It is common to create a new environment “from scratch” by specifying individual packages in the Anaconda Navigator, but for this tutorial we will use a configuration file called “environment.yml” that is part of the GitHub repository. This file describes all the packages that are necessary to reproduce an environment for running the tutorial notebooks. To create a new environment from this configuration file, click on “Import” (at the bottom of the main panel of Anaconda Navigator) and navigate to the repository root folder. By default Anaconda Navigator expects configuration files with “.yaml” or “.yml” file extensions, so only the file named “environment.yml” should be highlighted in the file dialog box. Select this file and click “Open”. The “Import new environment” dialogue box will have autocompleted the “Name:” field for the new environment (“MetabWorkflowTutorial”). To complete creation of the new environment, click on the “Import” button. Anaconda Navigator will show a progress bar in the main panel as it creates the new environment.
Once the environment has been created, click on the “Home” icon in the left-hand sidebar. In the main panel, the dropdown should now read “Applications on [MetabWorkflowTutorial]”, which indicates that the MetabWorkflowTutorial environment which was just created is now active. If “MetabWorkflowTutorial” is not visible, click on the dropdown menu and select that environment. Click on the “Launch” button under Jupyter Notebook in the main panel, to launch Jupyter in your web browser.
The Jupyter landing page will start in your home folder. To use the tutorial notebooks, navigate to the repository root. The notebooks for Tutorial 1 and 2 can now be run on your own computer, just as on Binder, by selecting the appropriate notebook file. However any output or changes to the contents of a notebook file will now be saved persistently in the local computer and can be reused at any time.
As an alternative you may wish to try to create a virtual environment and launch Jupyter in your web browser through a terminal window (command window). To do this open the terminal window (type ‘terminal’ in your computer’s search box), then type the following five lines of code:
Line one creates an exact copy of the github file directory on your local machine in the folder ‘MetabWorkflowTutorial’. Line two moves you into that folder. Line three creates the virtual environment called “MetabWorkflowTutorial” using the contents of the environment.yml file. Line four activates the virtual environment. Line five launches a local Jupyter notebook server and opens the Jupyter landing page in your web browser, from which you can run the tutorials.
To close the local Jupyter notebook server press “<control>c” twice in the terminal window and it will ask you to confirm the action. You may then close the virtual environment by typing:
When you no longer need the virtual environment, the following will delete it from your computer:
If you created a virtual environment using Anaconda Navigator you will have to delete the environment before creating a fresh version.
Tutorial 4: creating a new Jupyter Notebook on a local computer
Tutorial 4 builds on tutorial 3. Please ensure that the Anaconda Python distribution is installed on your computer.
In this tutorial we will create a new Jupyter notebook that demonstrates the use of visualisation methods available in Anaconda Python without the need to install additional third-party packages. We will upload a generic metabolomics data set and write code to produce four graphical outputs:
A histogram of the distribution of QCRSD across the data set.
A kernel density plot of QCRSD vs. D-ratio across the data set.
A PCA scores plot of the data set labelled by sample type.
A bubble scatter plot of molecular mass vs. retention time, with bubble size proportional to QCRSD
The data set included in this tutorial is previously unpublished, and of arbitrary biological value. It describes serum data acquired using a C18+ LC–MS platform consisting of 3084 unidentified peaks and 91 samples. Of the 91 samples, 23 are pooled QCs injected every 5th sample across the experimental run. The Peak table contains information on the molecular mass, retention time of each detected metabolite, and the associated QCRSD and D-ratio values calculated following recommended quality control procedures (Broadhurst et al. 2018). The data are presented in an Excel file using the previously-described “tidy data” format.
Tutorial 4 is available in a GitHub repository at https://github.com/cimcb/MetabSimpleQcViz. Download and unzip the repository to a folder on your own computer, using the method described in Tutorial 3 (the location of this folder will now be the “repository root”). This copy (clone) of the repository is for reference only as we will be recreating the contents of this directory under a different name as we move through this tutorial and Tutorial 5.
First create a new Jupyter notebook. To do this, start the Anaconda Navigator application if it is not already open. Ensure that “[base (root)]” is selected in the “Applications on” dropdown list of the main panel, then launch Jupyter Notebook. This will start a new Jupyter notebook server in your browser and show files from the home directory on the landing page. Navigate to the repository root (the “MetabSimpleQcViz” folder). To create a new notebook, click on the “New” button in the top right corner of the page. This will list supported Jupyter Notebook languages in the dropdown. Select “Python 3” from this list. A new tab will open in your browser, showing a blank notebook called “Untitled” (at the top of the page). Rename the notebook by clicking on the text “Untitled” and replacing it with “myExample”. This will create a new file in the repository called “myExample.ipynb”
When the “myExample.ipynb” notebook is launched, it contains a single empty code cell. We will use this cell to add a title to the notebook. To do this we need to convert the cell type to be a Markdown cell, then type a header in the cell, and execute it. First, select the empty cell by clicking anywhere within the cell. To convert the cell type, click on the dropdown field marked “Code” in the top menu bar and select “Markdown”. The “In:” prompt should disappear from the left-hand side of the cell. Now click inside the cell to see the flashing cursor that indicates the cell is ready to accept input. Type “# Tutorial 4” and click on the “Run” button in the top menu. The formatting of the first cell should change, and a new code cell should appear beneath it.
In the new code cell, we will place Python code that:
Imports the Pandas package (necessary to load the Excel spreadsheet).
Loads the dataset into variables called “data” and “peak”.
Report the number of rows and column in the tables.
Displays the first few lines of the resulting table.
The required code is provided in the static supplementary html file Tutorial4.html (https://cimcb.github.io/MetabSimpleQcViz/Tutorial4.html) and “Tutorial4.ipynb” notebook and can be copy-and-pasted or typed in manually, as preferred. When the code is complete, click on the “Run” button again to execute the cell. On completion, two tables should be visible below the code cell (one for “data”, one for “peak”), and a new empty code cell should be placed beneath this.
Next we add the code required to draw a histogram of the RSD values across all the detected peaks in this data set. Using the Tutorial4.html file as a guide, add in the required explanatory text and Python code and click on the “Run” button after each step.
Continue adding in the remaining explanatory text and Python code using the Tutorial4.html file. After completion you will have a Jupyter notebook that takes a metabolomics dataset through the process of generating diagnostic plots for quality control. Once you are satisfied with the state of the notebook, it can be saved by clicking on the floppy disk icon (far left on the menu). The notebook can then be closed by clicking “File” and then “Close and Halt” from the top Jupyter menu. The notebook tab will be closed, showing the Jupyter landing page. The Jupyter session can be closed by clicking on “Quit” on the Jupyter landing page tab of your web browser (this tab may not close automatically).
Tutorial 5: deploying a Jupyter Notebook on Binder via GitHub
Tutorial 5 builds on tutorial 3 and 4. To complete this tutorial, we will create a new GitHub repository. A GitHub account is required for this. If you do not already have a GitHub account, please follow the instructions on GitHub at https://help.github.com/en/articles/signing-up-for-a-new-github-account.
To create a new repository, log into the GitHub site (if you are not already logged in) and navigate to your profile page (https://github.com/<yourusername>), then click on the “Repositories” link at the top of the page. To start a new repository, click on the “New” button at the top right of the page. This will open a new page titled “Create a new repository.” Each repository requires a name, and this should be entered into the “Repository name” field; use the name “JupyterExample”. Beneath the Repository Name field there is an optional Description box, and then below this a choice of public or private repository. Ensure that the ‘Public’ option is chosen. Select the checkbox to “Initialize this repository with a README” (this is a file in which you will write useful information about the repository, later). Below this is the option to “Add a license” file. There are many alternative licences to choose from (https://choosealicense.com/), and the choice for your own projects may be constrained by funder, home organisation, or other legal considerations. We strongly recommend that all projects carry a suitable licence, and that you add the MIT License to this tutorial repository. Now, to create the repository, click the “Create repository” button.
On successful creation of the repository, GitHub will present the new repository’s home page (this will be at https://github.com/<yourusername>/JupyterExample), with some options for “Quick setup”. Under the “Quick setup” notice, the LICENSE and README.md file will be shown, and clicking on either will open them. The README.md file for a repository is automatically displayed on the homepage, but in this case, it is empty (we can add text later).
Now we need to add the new Jupyter notebook and the Excel data file from tutorial 4 to the repository. We will do this using the GitHub “Upload files” interface, though there are several other ways to perform this action. To use the GitHub interface, click on the ‘Upload files’ button and either drag files from your computer, or click on “choose your files” to select files with a file dialogue box. Add the ‘myExample.ipynb’ and ‘data.xlsx’ files from your repository root. These files will be placed in the “staging area”, visible on the webpage but not yet committed to the repository.
GitHub imposes version control as a form of best practice on the repositories it hosts. One of the features of version control best practice is that a description of the changes made to a repository should accompany every “commit” to that repository. To do this, enter the text “Add data and notebook via upload” to the top field under “Commit changes.” Then, to commit the files to the repository, click on the “Commit changes” button.
Now that there is a publicly hosted GitHub repository containing a notebook and dataset, we are nearly ready to make the notebook available interactively through Binder. The final necessary component required is a configuration file. This file is vital, as it defines the environment Binder will build, with a specified programming language and all the necessary packages for the notebook to successfully operate. This configuration file is an Anaconda YAML file called ‘environment.yml’ and it contains a list of dependencies (the programming language version and a list of packages used in the notebook) and channels (the location of these resources in the Anaconda cloud library). Detailed consideration of how to create these files is beyond the scope of the tutorial. Upload the environment.yml file from Tutorial 4 (it is also included in the Supplementary File, to cut and paste if required) to the repository in the same way that the notebook and data files were uploaded.
We are now ready to build and launch a Binder virtual machine for this repository. To do this, open https://mybinder.org in a modern web browser. The landing page presents a set of fields to be completed for Binder to build a virtual machine. The minimal requirement is to specify a GitHub repository URL in the “GitHub repository name or URL” field Enter the path to the home page of your repository (https://github.com/<yourusername>/JupyterExample) in this field, and click on the ‘Launch’ button. Binder will use the configuration file in the root directory to build and store a Docker image for your repository. This process often takes several minutes.
Once the Binder repository is built, the URL shown in the field “Copy the URL below and share your Binder with others” (here: https://mybinder.org/v2/gh/<yourusername>/JupyterExample/master) can be shared with colleagues. A button to launch the Binder can also be added into the README file on GitHub (we also strongly recommend this). Anyone using this URL in their browser, will be provided with an individual interactive session (1 CPU, 2 GB RAM running on Google Cloud) making available the notebooks of your repository in an interactive and editable form.
Congratulations, you have created your first Binder notebook! Now share it with your colleagues!
It is important to remind users that data uploaded to a public GitHub repository is indeed public. If the user wants to share Jupyter Notebooks but not any associated metabolomics data (or other sensitive data) then clear instructions on how to securely access and download the data needs to be included in the notebook text, and the location of that downloaded data be included in the requisite notebook code block (this could be a local hard drive, or uploaded to Binder while in session). If institutional security concerns preclude using a collaborative workspace such as Binder, then alternative cloud solutions such as Microsoft Azure can be investigated. Before doing so it is probably best that to consult with your institute IT representative.