FormalPara What You Will Learn in This Chapter

When performing an image analysis pipeline, a programming language like Python is mainly used for two distinctive applications: (1) the analysis of the acquired images, such as background removal, noise reduction, object segmentation, measurements of biological structures and events, etc. and (2) the analysis of the data obtained as a result of the image analysis, such as a calculating a histogram from the noise-removed image or statistics on the shape of the segmented object. The aim of this chapter is to show how Python can be used as a tool to analyze the data obtained as the final step of a bioimage analysis workflow. We will learn how to arrange the data into a tidy form, which is a way to structure the data to simplify the later analysis. Python libraries pandas, for data handling, and bokeh and holoviews, for data plotting, are discussed along this chapter. Jupyter notebooks are fully available to follow the examples, however, minimal Python knowledge is required (the concepts of Python lists, dictionaries and arrays should be known).Footnote 1

3.1 Tools to Follow the Chapter

This chapter uses AnacondaFootnote 2 as the Python Distribution and Jupyter NotebooksFootnote 3 to run the Python code. As mentioned, Jupyter Notebooks (specified at the beginning of each section) are available for the reader to follow the examples and to try out the Python code:

  1. 1.

    NB-0-Installation_Guide.ipynb: Installation of Python distribution and all the packages needed to follow the chapter.

  2. 2.

    NB-1-Python_Introduction.ipynb: Brief introduction to basic operations in Python, which will be useful if you are new to Python.

  3. 3.

    NB-2-Pandas_Data_Handling.ipynb: This notebook covers Sect. 3.3, how to handle data using the package pandas.

  4. 4.

    NB-3-Bokeh_Plotting.ipynb: This notebook covers Sect. 3.4, specifically 3.4.1—using Bokeh to create interactive figures.

  5. 5.

    NB-4-Holoviews_Plotting.ipynb: This notebook covers Sect. 3.4, specifically 3.4.1—using HoloViews to create interactive figures.

These notebooks are available in Github.Footnote 4

3.2 Why Python?

Python is a high-level programming language designed in the early 1990s by Guido van Rossum. It executes instructions without the need of a compiler—i.e., it is an interpreted language—and its operations are done at run-time—i.e., it has dynamic semantics—making it a fast language to prototype in. For at least two decades it has been widely used, which makes it beginner-friendly due to the amount of tutorials and documentation that exist on the web. In fact, in the past years Python has shown a huge growth in demand, due to the increase in:

  • Publications: Books, conferences, journals;

  • Users: Number of downloads and number of uses. A trend calculated by StackOverflow,Footnote 5 which counts the tags and posts on the platform, shows a very high and still increasing popularity of the Python language.

  • Applications: web and internet development, scientific and numeric computing, education (teaching programming), desktop GUIs (graphical user interface), software development among other applications.Footnote 6

The purpose of this chapter is to show one way to use Python as a tool to analyze data and obtain browser-interactive figures which are easy to share. We will use Anaconda as our Python distribution to simplify the package installations. As our web-based application for writing and running Python code we will use Jupyter Notebooks, which combine code with narrative text, equations, and visualizations. There are some other great notebook alternatives, like for example Google ColaboratoryFootnote 7 (Colab for short) that allows the execution of Python in a browser without the need for any prior installations and with free access to GPU.

3.2.1 Python Versions

Since the first release in 1994, there have been several Python versions. Newer versions add features either in the language itself, its built-in functions, or in standard library support modules (Mertz, 2015). The two most recent versions are Python 2 and 3. Python 2 has not received further updates or bug-fixes as of January 2020.Footnote 8 In this chapter we will be using Python 3.6 or higher.

3.2.2 Python Packages and Environments

Python, like many other programming languages, allows modular programming. This means that the code can be broken down to create smaller and more manageable scripts named modules. Grouping these modules can then result in a Python package. For example, NumPy (https://numpy.org) is a package for scientific computing (Harris et al., 2020) which we will be using later in this chapter. Depending on what we want to achieve, we will need different packages that already exist.

There are several ways to install packages which will be explained later. Once the packages are installed, in order to use them we have to make them available in our code. For example, to use all the functions in the NumPy package, we first need to import NumPy using the import statement:

figure a

Here we have imported the package NumPy which is now bound to the name we have chosen, np (which in this case is standardized). This means that whenever we want to call a NumPy function, e.g. to calculate the square root of 4, we will use:

figure b

The same way Python has different versions, the packages have them as well. Depending on the project we work on, we might need different package versions. However, it could be problematic if two different projects need different versions. This is where the environmentsFootnote 9 are very useful. They allow the creation of an isolated environment for each of the different projects, where the package versions are independent in each environment. There are several ways to set up a virtual environment depending on the tools used to run Python. Later in this chapter we will learn one of the many ways to do so.

3.2.3 Anaconda

Jupyter Notebook:  NB-0-Installation_Guide.ipynb

Python and installation of packages can sometimes be complicated, which is why here we describe an easy way to do so, with the minimal amount of potential problems. There are many ways to set up a Python environment for scientific computing or for any other purpose. Two common ones are:

  1. 1.

    Installing packages on demand from the Python Package Index (PyPI), a repository of software for the Python programming language. As of today, there are more than 250,000 packages which can be downloaded from PyPI using the package installer pip.

  2. 2.

    Downloading a Python distribution that already contains many of the most popular packages needed. One of the major distributions, and the one we are using in this chapter, is AnacondaFootnote 10 which contains conda to manage and install packages. You could also install Miniconda,Footnote 11 which is a free minimal installer using conda.

pip is the recommended tool for installing packages from PyPI. pip installs Python software, but may require that the system has compatible compilers, and possibly libraries, installed before invoking pip. Another installer tool is conda, which can handle both Python and non-Python installation tasks. conda is an open-source cross-platform package and environment manager that can install and manage packages from the Anaconda repository, Anaconda Cloud and other channels such as conda-forge.Footnote 12 There is never a need to have compilers available to install conda packages. Additionally, as mentioned before, the packages may also contain C or C++ libraries, R packages, or any other software.

We will use conda to install the packages for this chapter. However, it is good to understand the main differences between these two package managers—pip and conda—to know when to use which of them. They are summarized in Table 3.1.

Table 3.1 Main differences between pip and conda. For a more detailed explanation, visit https://www.anaconda.com/blog/understanding-conda-and-pip
Fig. 3.1
figure 1

Upper panel: Anaconda GUI included with the Anaconda distribution. It contains, among others, Jupyter Lab (which is a more interactive version of the Jupyter Notebook), Jupyter Notebook (which we will be using in this chapter) and Spyder which is more similar to Matlab (it contains a variable explorer which resembles Matlab work-space). In the Anaconda Navigator we can manage the environments and the packages. We can also do this by using the Anaconda Prompt (command line shell). Lower panel: Example of a Jupyter notebook with the two main types of cell: Markdown, for text and equations and Code, for writing Python code (you could also set it up to write R, Julia, Groovy, Java...). The Command Palette shows keyboard shortcuts

As mentioned earlier, we will use Anaconda   (https://docs.anaconda.com/anaconda) as our Python distribution to simplify Python and package installations. Moreoever, Anaconda is a package manager, an environment manager, a Python/R data science distribution, and a collection of over 7500 open-source packages. It was created with the aim to simplify package management and deployment. Package versions in Anaconda are managed by the package management system conda. It also includes a graphical user interface (GUI), Anaconda Navigator (Fig. 3.1), which is an alternative to the command-line interface.

3.2.4 Jupyter Notebook

Once Python is installed, there are many ways to run Python code, for example using the command-line or terminal by typing in python (or python3, depending on the installation) and hitting enter. However, in this chapter we will run Python code in a web-browser in a way which allows that we mix code, text, and equations, such that it resembles a notebook.

When Anaconda is installed, we get Python installed, and—conveniently—in addition we get installed several commonly used packages for scientific computing and data science and some applications, including Jupyter Notebook (which can also be installed without Anaconda, using pip).

Project Jupyter is a non-profit, open-source project, born out of the IPython Project (https://ipython.org) in 2014, as it evolved to support interactive data science and scientific computing across many programming languages (https://jupyter.org). Jupyter Notebooks allow to write code, Markdown text and equations and save the notebooks as Hypertext Markup Language (HTML) or even as Portable Document Format (PDF). Figure 3.1 shows an example of a Jupyter Notebook and some of the basic commands to start using it. However if this is the first time you are using Jupyter Notebook, you might want to check the Project Jupyter recommended documentation: https://jupyter.readthedocs.io.

Once we have Anaconda Distribution and we have downloaded all the packages and ran a Jupyter Notebook, we are ready to start handling and plotting data in the following sections. If you have not done this yet, NB-0-Installation_Guide.ipynb will guide you through the installation steps.

3.3 pandas: Python Data Analysis Library

Jupyter Notebook:  NB-2-Pandas_Data_Handling.ipynb

As part of an image analysis pipeline, we will likely be handling and analyzing measurements of experimental image data. One of the most time-consuming parts is often arranging the data so that it is in a suitable format to perform the analysis and visualization of the results. pandas is a powerful tool for working with tabular data in the Python ecosystem. This section describes the use of pandas and how to arrange the data in a tidy format to make the analysis and visualization easier.

p andas is an open source library which allows efficient manipulation, reading and writing of (tabular) data. It was initiated by McKinney et al. (2011) and since then, it has been widely used in the Python community with the aim to be a fundamental high-level building block for doing practical, real world data analysis in Python (https://pandas.pydata.org/). pandas makes it easy to work with labeled data: we can handle and arrange the data but, we can also label information on the data points, making it a powerful tool for handling metadata.Footnote 13

The standard way to import pandas package is by using:

figure c

Moreover we will also use the NumPy package which is why we are also importing it at the beginning of our code. In the following sections we will explore the power of pandas primary data structure, the DataFrame. We will also learn how to import/export data with pandas and how to arrange the data so that it is easier to perform statistical analysis and plotting.

3.3.1 Syntax: Creating a DataFrame

pandas library is built on top of the NumPy package, which means that most of the NumPy functions are available for the pandas objects. However, what makes pandas so useful with respect to NumPy objects is the way the data is structured. pandas data structures have rows and columns with a similar appearance as the tables in Excel or CSV files (among others), which makes statistical analysis easier. But before we get into more complicated data wrangling methods, we first define the most fundamental units of the pandas data structures: a Series and a DataFrame.

Fig. 3.2
figure 2

Basic structure of a DataFrame and a Series. The name of each component is important—we will be using them along the chapter

3.3.1.1 Series

pandas has two main data structures: Series, for 1-dimensional labeled data, and DataFrame, for 2-dimensional labeled data. They have similar structure: index column, column(s) and rows (Fig. 3.2). Each column has a name associated with it, also known as label.

A Series is the simplest concept, therefore we will start by understanding how we can create one. The following line of code shows how to initialize a Series.

figure d

Here, data can be a Python dictionary, a NumPy array or a scalar value. The next parameter, index, is a list of axis labels (which is not the same as the column label). If no index is passed, one will be created having values [0, ..., len(data) - 1]. Also, as a NumPy array, a pandas Series supports dtype which can be float, int, bool, etc.

Here are three examples of how to initialize a Series:

figure e
figure f
figure g
figure h
figure i
figure j

A Series is a NumPy array-like, which means that it can be passed into most NumPy methods expecting a NumPy array. However, a key difference between pandas Series and NumPy ndarray is that operations among Series automatically order the data based on the index. Therefore, if we need an actual ndarray, we can use the command Series.to_numpy().

3.3.1.2 DataFrame

The most commonly used pandas concept is a DataFrame, a 2-dimensional labeled data structure with columns of potentially different types. Similar to a Series, a DataFrame object can be created using the following line of code.

figure k

The data can be a Python Dictionary of 1D arrays, lists, dicts or Series, as well as a 2-D NumPy array or another DataFrame. The DataFrame has labeled axes: rows (axis=0) and columns (axis=1). The rows and columns can be accessed by the index and columns attributes, respectively: DataFrame.index and DataFrame.columns

Once the DataFrame has been defined, we can select, add, and delete columns in similar ways as a Python dictionary.

figure l
figure m
figure n
figure o
figure p
figure q

By default, a column is inserted at the end of the DataFrame. However, using the insert function, we can specify the location (loc) of the new column and the values we want to insert.

figure r

DataFrames are indexed by columns, df[column_name], but we can also select both rows and columns by using:

figure s

There are several ways to index a DataFrame; some of them are summarised in Table 3.2.

Table 3.2 Indexing a DataFrame is intuitive to help getting and setting subsets of the data-set. For more information on indexing DataFrames, visit https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

3.3.2 Basic Numeric Operations

Pandas has methods and functions to carry out binary operationsFootnote 14 for matching and broadcasting behaviour. In the following example we initialize two DataFrames, df1 and df2, using two dictionaries, d1 and d2:

figure t
figure u

Some basic binary operations are addition add(), subtraction sub(), multiplication mul(), and division div(). The following example shows the addition of a column of df1 and a column of df2.

figure v
figure w

The same result can be achieved by computing df1["123"]+df2["123], but using the add() method allows us to choose the dimensions and labels we want to use. The axis parameter allows using index (axis=0) or columns (axis=1) for the addition operation. Moreover, in the case of missing data, there are operations that include the parameter fill_value. If fill_value=0, the missing values, which in DataFrame by default are NaN values, are treated as zeros. If the same values are missing in both DataFrames, they will continue to be NaN. When computing df1["123"]+df2["123], if there is a missing value, a NaN will be added in that position.

With a Series or a DataFrame it is very simple to compute descriptive statistics, e.g., as the mean value is computed in the following line of code:

figure x

In this example we apply the mean operation to the axis we choose (axis = 0 for index, axis = 1 for columns). In case there are missing values which have been replaced with a NaN, the skipna parameter, which is true by default, will exclude all NaN values from the computation. Finally, we can choose whether we want to include only float, int or boolean columns in the calculation, by specifying the parameter numeric_only. There are many other descriptive statistics; some examples are shown in Table 3.3. For more examples, visit the website.Footnote 15

Table 3.3 Examples of descriptive statistics for DataFrame and Series

3.3.3 Import Data Using pandas

In the previous section we learned how data is structured in the pandas Series and DataFrame. When performing image analysis tasks we will most likely be using some other software, such as Fiji (Schindelin et al., 2012), to perform segmentation, cell tracking, protein co-localization analysis, etc. The outcome of such analysis usually comes in the form of a table. A usual next step is to export this table-like data into some software, such as R, Python, MATLAB, etc, to extract useful information.

With pandas we are able to read and write different data types: Microsoft Excel files (pd.read_excel(), pd.to_excel()), comma separated values files—CSV (pd.read_csv(), pd.to_csv()), JSON files (pd.read_json(), pd.to_json()), HTML (pd.read_html(), pd.to_html()), HDF5 (pd.read_hdf(), pd.to_hdf()), and more.

In this chapter, we focus on CSV files, since they are easy to read into data structures in many programming languages. As a general rule, we should always try to save the data in file formats that are open and readable in many contexts regardless of the specific software of choice.

To read a csv file into a DataFrame, we use the following line of code:

figure y

Here we show only some of the many parameters to choose from the CSV reader. They help creating a DataFrame that best describes the data. To check all of the available parameters, visit the website.Footnote 16

  • filepath_or_buffer: Any valid string path.

  • sep: Delimiter to use. By default, it is assumed that the data is separated by commas (sep=",").

  • header: Row number(s) to use as column names for the DataFrame. For example, (header=[0,1,3]) will use the rows 0, 1 and 3 as headers, and will skip row 2. The default is to use the first row as column header (header=0).

  • usecols: Returns a subset of the columns. For example, using integer indices of the data columns usecols=[0,1,2] or strings that correspond to the names of the columns in the data ["A", "B", "C"].

  • mangle_dupe_cols: If there are two or more columns with the same name, by default they will be written as "Col", "Col.1", "Col.2". If mangle_dupe _cols=False, columns with the same name will be overwritten.

  • na_values: Additional string values to be recognized as NaN. By default, any blank space will be recognized as a NaN, but also some other strings such as <NaN> and nan. This allows to apply statistics in a missing-value-friendly manner. This option allows other strings to be specified to also be included in the DataFrame as a NaN.

Once the data is imported and we are satisfied with the DataFrame we created, the next step, that helps to get the most out of the data, is to "tidy" this data-set. In the following section we will learn how to accomplish this with our already created, or imported, DataFrame.

3.3.4 Reshape the Data: How to Create Tidy Data

Great part of the time and effort invested in analyzing a data-set goes to organizing the data and handling missing values, among many other preparation steps performed every time new data is collected. The way we build, e.g., an Excel file, is the most intuitive way for human perception, however, we should always try to convert the data into their tidy form. Tidy data-sets are data-sets that are arranged such that each variable is a column and each observation is a row (Wickham et al., 2014). This section gives a general explanation of what is tidy data, how it can be accomplished, and some of the benefits of analyzing a tidy data-set. Wickham defines data tidying as a standard way to clean the data. This allows us to map the meaning of a data-set to its structure (its physical layout). This structuring of the data facilitates the analysis specially if one is using vectorized programming languages, such as R or Python with NumPy. Specifically, tidy data complements panda’s vectorized operations. We will see some examples in the following sections.

A messy data-set would be any other arrangement of the data. In Table 3.4, we have two examples of typical representations of messy data-sets. The data-table on the left represents results of a titration experiment in which the goal is to check how a measurable, e.g., mean fluorescence intensity of a gene expression marker, changes with different pulse duration (columns) and drug concentrations (rows). In this table, both the columns and the rows are labeled. The data-table on the right represents a similar experiment in which the same measurable should be checked, but in this case using two concentrations of DMSO (Dimethyl sulfoxide) as control and two concentrations of a drug being tested. In this case we observe what is called multi-index, with two levels of columns.

Table 3.4 Examples of two messy data sets. Table to the left includes labeled rows and columns. Table to the right contains multi-index: two concentrations for DMSO treatment and two more for the Drug treatment

To convert the examples of messy data shown in Table 3.4 into their tidy forms, we need to identify the variables which should form the columns in our tidy data-set. In the first case, the pulse duration and the concentration of the treatments will be the two variables (the measures of two attributes). In the second case we will have two different treatments: Drug and DMSO and the concentration of these treatments: 0.1\(\%\), 0.5\(\%\), 10 \(\mu \)M, 50 \(\mu \)M. Following this rearrangement, we can obtain a corresponding tidy data-set (Table 3.5):

Table 3.5 Example of a tidy form of the data-sets. In both cases (left and right), the columns in the tables are variables, whereas rows are observations: the result of one pulse duration with a specific drug concentration (left), and the result of a concentration from a given treatment (right)

Now we know what a tidy data-set is. The next step is to learn how to implement pandas functions to transform the structure of the data into a cleaned and ready-to-analyze tidy form.

3.3.4.1 Changing the Layout of the Data-Set to Get Tidy Data with pandas

One of the most useful functions to tidy our data-sets is pd.melt(df). This function allows us to gather columns into rows from a DataFrame, which means to go from wide format (like in table 3.4) to long format (like in table 3.5). One thing to consider before melting the DataFrame is to specify what are the values and what are the variables:

figure z

Figure 3.3 illustrates the meaning of each of these parameters and how they will help to reshape our data into a tidy form.

The data in its tidy form is convenient for analysis. However, once we have finished all the analysis, we might want to have the data back in a form which is prettier to visualize as a table. To go in the opposite direction, i.e., from long to wide format, we can pivot our DataFrame:

figure aa

Figure 3.3 also shows the parameters from the pd.pivot(df) function, used to reshape the data into a wide form.

Going back to the messy data examples from the previous section (Table 3.4), we can use pandas function pd.melt() to convert them into their tidy form (Table 3.5).

figure ab
Fig. 3.3
figure 3

Graphical examples of how to melt and pivot DataFrames. Here we show what each parameter represents for the methods melt and pivot, to better understand how the data can be re-arranged and re-shaped. Inspired from the cheat-sheet by Irv Lustig from Python Data Wrangling Cheat-sheet (https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

After reading and saving the CSV file data into a DataFrame, we use the pd.melt() function. "Concentration" is already a column, so we assign it as our identifier variable. Next, we create a variable column called "Pulse Duration", which is currently the first row (5 min, 10 min, 20 min and 30 min). Finally, we rename the last column, which contains the intensity measured values, as "Result".

figure ac

In this second example we also read and save the CSV file data into a DataFrame, but in this case, we specify that the first two rows contain the variables which will become the headers. The next step is to melt the DataFrames. In this case there are two rows which we have to convert into variable columns: "Treatment" (DMSO and Drug) and "Concentration" (0.1\(\%\), 0.5\(\%\), 10\(\mu M\), 50\(\mu M\)). Finally, we rename the last column "Result" which contains the measured intensity for a treatment at a given concentration.

Now we have created tidy data, and we can manipulate, model, and visualize it easily and effectively. In the following section we will learn how to manipulate a tidy data-set.

3.3.5 Split-Apply-Combine

Usually, we perform some analysis based on some attributes of the data that we want to compare, or extract meaningful information, by performing statistical and/or numerical analysis. Intuitively, to do so, we (1) split the data into groups according to some criterion; (2) apply some functions to analyze the split-up data; and, finally, (3) combine the results to be saved in a new data set. The good news is that there is a conceptual framework to apply these steps—it is called the Split-Apply-Combine strategy, and was first formalized by Wickham et al. (2011). In this article Wickham describes the strategy as: "break up a big problem into manageable pieces, operate on each piece independently, and then put all the pieces back together". An R package was created with this strategy, but now pandas has its own way to implement the same idea.

This strategy only makes sense if the data is in a tidy format, because it will be split-up according to the selected columns. Therefore, we can apply functions to this newly grouped data and combine the results into a new data-set.

For an extensive tutorial on how to apply the split-apply-combine strategy using pandas, please visit the website.Footnote 17

3.3.5.1 Split

The df.groupby operation performs the splitting step using any data axis. It allows the grouping of a DataFrame usually by one or more of the columns. The result is a DataFrameGroupBy object. Let us take the two DataFrames from Table 3.5 (in a tidy format) as an example; we can apply simple grouping operations:

figure ad
figure ae
figure af

Once the data is grouped, we can split it by using the df.get_group() method. For example, we grouped the melted DataFrame df1_melt according to pulse duration, which gave rise to the groups: 5 min, 10 min, 20 min and 30 min. Now, we can get one of these groups.

figure ag
figure ah

3.3.5.2 Apply

Once the data has been grouped and split-up, we can apply different functions to the newly created DataFrames. For this, we may use one of the following operations:

1. Aggregation: Computes one or more summary statistics to the group_by object. The following example computes the mean and the sum using a NumPy function.

figure ai
figure aj

By default, the grouped columns from the aggregation will be the indices of the returned object. In order to have the indices restored, we can use reset_index(). The aggregation functions reduce the dimension of the returned object. Moreover, we can apply different functions to different columns:

figure ak
figure al

If we want to show a summary of all statistics, one way to do that is by using the describe method:

figure am

This leads to the following:

Pulse

Duration

Count

Mean

Std

Min

25\(\%\)

50\(\%\)

75\(\%\)

Max

10 min

4.0

9.525

0.377492

9.2

9.200

9.5

9.825

9.9

20 min

4.0

13.375

0.699405

12.5

13.100

13.4

13.675

14.2

30 min

4.0

17.250

0.369685

16.9

16.975

17.2

17.475

17.7

5 min

4.0

3.925

1.426826

2.3

2.975

4.0

4.950

5.4

2. Transformation: Performs some computation to a specific group. This method returns an object which has the same size and index as the grouped object. In the following example we take the grouped object and we select one of the groups to apply two functions to the corresponding values; in this case we compute the square root and an exponential:

figure an
figure ao

We can also create our own functions and apply them with the transformation method (see example in Fig. 3.4). Some other built-in useful transformations are (1) rolling(), which applies rolling window calculations (there are several window types: Gaussian, Hamming, etc.), and (2) expanding(), which accumulates a given operation for all the members of each particular group.

3. Filtration: Discards some groups according to some group criteria. When we apply a function to a group as a filter argument, the output will be Boolean (true or false). The example below groups the DataFrame df2_melt by the category Concentration. The filter method will then look in the Result column for all the values from each concentration group (50 \(\mu M\), 0.1\(\%\)...) that have a higher mean value than the overall mean in the column. This is performed by iterating over all the groups mean values of the DataFrame using the lambdaFootnote 18 function which will return a true/false value for each of the rows in the filtered column. The printed results will be the ones which were evaluated as true.

figure ap
figure aq

x is the equivalent to each concentration group from which we compute the mean value from the Result column and compare it with the overall mean. As a result, we get that the concentrations of 0.5\(\%\) of DMSO and 50 \(\mu M\) of Drug have higher mean values than the overall mean.

Fig. 3.4
figure 4

Split-Apply-Combine strategy in Python starting from a tidy data-set df. (1) Split the data based on some criteria, using the df.groupby(). We can then access each of these groups by using the get_group() function. (2) Apply either an aggregation, a transformation, or a filter operation. Aggregations apply an operation to a group giving one value as a result, such as the mean. Transformations apply a function to all the values of a given group. These functions can be built-in, like np.exp() from NumPy, or custom-defined. Filtration applies an operation which returns Boolean indices and, as a result, only the values with true index are shown. Usually the dimensions get reduced from the original size. (3) Combine the results using operations like pd.concat() to concatenate DataFrame, to later export them into CSV or any other table file format

There are some functions which, when applied to a DataFrame, can act as a filter, returning a reduced shape but with unchanged index. For example, when a Series or a DataFrame are extremely long, but we still want to visualize how the data has been organized in the columns and rows, the functions head() and tail() come in handy. To view a small sample of a Series or a DataFrame object, we can use the DataFrame.head() method to display the initial (by default, five) values and use the DataFrame.tail() method to display the last (by default, five) values.

3.3.5.3 Combine

The function pd.concat([df1,df2], axis) allows to concatenate DataFrames along a particular axis; an example is shown in Fig. 3.4. Once the data is analyzed, we can then combine them into new DataFrames and export them into any of the available file formats (using pd.to_fileformat()). In the following example, we combine the results from two transformation methods into one new DataFrame by using the pd.concat() function. Depending on the axis, we can combine the two DataFrames horizontally or vertically.

figure ar

The next step is to combine the results using the concatenation function.

figure as
figure at

We combine now the two results from the Drug treatment and the DMSO treatment, again using the concatenation, but in this case we use the other dimension: pd.concat(axis=1).

figure au
figure av

As a result, the two DataFrames were horizontally concatenated, but according to their index. Therefore, in order to reset the index, we will use the df.reset_index(drop=True) to make the index start from 0 in both DataFrames. To avoid the formation of a new column with the index values, we use the drop=True.

figure aw
figure ax

3.4 Python Visualization Landscape

One of the main advantages of using pandas data structures, besides the easy handling of the data, is the creation of plots. The structure and metadata inside a DataFrame can be easily used to create plots. There is a wide range of different visualization tools available in Python, which should be selected depending on a particular visualization purpose. In this chapter we will focus on Bokeh and HoloViews, JavaScript based packages which produce interactive figures in the browser with the Jupyter Notebook.

3.4.1 JavaScript

JavaScript is a high-level programming language which enables creation of interactive web pages and is frequently used in web applications. Python has many visualization libraries based on JavaScript in order to take advantage of browser interactivity. Currently, having tools which allow easy distribution of the visualization of the data can be very powerful. To learn more about how to turn raw data into interactive web visualizations using a combination of Python and JavaScript, Dale (2016) is a recommended read.

3.4.1.1 Bokeh

Jupyter Notebook:  NB-3-Bokeh_Plotting.ipynb

Bokeh is a popular interactive data visualization library for Python which allows to easily share figures. Moreover, Bokeh can handle large and streaming data-sets. To create a figure with Bokeh, the following are the basic steps:

1. Before creating any plot, the first step is to import all the packages and subpackages that will be used to create the figures:

figure ay

2. Prepare the data we want to plot, which can be a NumPy array, Python lists, or a pandas DataFrame, as in this example:

figure az

3. Define where to generate the output file, using either output_file() (to generate output saved to a file), or output_notebook (to generate output in notebook cells):

figure ba

4. Create a figure() object. This will generate a plot with the default options. We can later customize axis labels, title and tools. In this case, we choose some of the most frequently used plot tools which are later explained in more detail in Fig. 3.5. These tools can be used to zoom-in and -out of the plot, change range extents or to add, edit and delete the graph, etc.

figure bb

5. Add a graph, which can be line(), scatter(), vbar(), hbar(), and many others we can choose from. Some more examples are shown in Fig. 3.6. Moreover, we can choose color and size of the graphs, label sizes etc.

figure bc

6. Choose whether to show the figure, show(p) or save it save(p). We cannot generate a vector output like PDFs (Portable Document Formats) or EPS (encapsulated PostScrip) but Bokeh allows us to save in SVG (Scalable Vector Graphics) format.

figure bd
Fig. 3.5
figure 5

Example of a Bokeh scatter plot and its tools. Two DataFrames were used to generate this plot. To the left, there are some of the tools we can activate in the figure() section to allow more interactivity. Also, we can have interactive legends which allow you to observe one data-set at a time

Fig. 3.6
figure 6

Examples of different types of Bokeh plots. On the left panel: The DataFrame used to generate the plots. This data was already melted and split-up in Sect. 3.3.5. On the right panel: Some other examples, the code to generate them is in the NB-3-Bokeh_Plotting.ipynb

The code to generate all of these figures can be found in the Jupyter Notebook: NB-3-Bokeh_Plotting.ipynb.

This generates a scatter plot like the one shown in Fig. 3.5, with the assigned tools. We can add more tools and customize them. Moreover, the plotting parameters can also be adapted, as suitable for a particular figure (visualisation task), but the process always includes all the described basic steps. Some other examples of how to create plots like histograms, box-plots, bar plots and line plots are shown in Fig. 3.6.

Bokeh has great interactive features. It is a high-level library, but it requires all the described steps to generate a figure. HoloViews will make the process of generating a figure even easier. Their philosophy is: "Stop plotting your data—annotate your data and let it visualize itself" (http://holoviews.org).

3.4.1.2 HoloViews

Jupyter Notebook:  NB-4-Holoviews_Plotting.ipynb

HoloViews is an open-source Python library for simple and easy data analysis and visualization. The approach is that each data-set should have an intrinsic way to be used for its visualization. The intention with this library is to produce intelligent visualizations based on how the data is structured. However, one important point to take into account is that the data must be tidy!

HoloViews can be rendered using either Matplotlib, Bokeh, or Plotly. To do so, we need to specify an extension: hv.extension("bokeh") (we will be using Bokeh). Now the plots will be rendered using Bokeh. Next, we generate the figure following the steps below:

1. As before, start by importing the packages needed for creating a figure using HoloViews:

figure be

2. Create the data, in this case two DataFrames. Specify which columns of the DataFrames should be plotted in the figure.

figure bf

3. Choose a type of plot (e.g., scatter, box-plot, histogram, heat-map, etc.).

figure bg

With these steps we created two objects which will then be rendered with Bokeh (because we chose this extension). Next step is to choose the styling elements for better visualization, using hv.opts().

figure bh

Finally, we choose how we want to visualize the two plots. There are two types of containers: a layout (HoloViews objects displayed side by side, achieved using "+") or an overlay (HoloViews objects displayed overlaid, with the same axes, achieved using "\(*\)").

figure bi

The output from this scatter plot using HoloViews is shown in Fig. 3.7. As with Bokeh, the plots have, by default, a set of tools which allow more interaction with the data. In this figure there are also some other examples which are explained in more detail in the Jupyter Notebook:

NB-4-Holoviews_Plotting.ipynb.

Fig. 3.7
figure 7

Examples of plots created using HoloViews. The scatter plot corresponds to the step-by-step figure we generated before (Fig. 3.5). The histogram, bar plot, box plot, scatter plot, and scatter-curve-errorbar plot are examples of figures that can be generated using HoloViews (with the Bokeh extension). The plots were generated using a data-set from Covid-19 cases in 2020 in Ireland (https://zenodo.org/record/3901250#.XykEDi17FZI)

Take-Home Message

This chapter provides a guide to use Python as a tool for analyzing and plotting the data as the last step of an image analysis pipeline. With the great increase in number of tools and software to acquire and analyze images, we are able to extract large amount of data (often in the form of tables). pandas is a powerful tool for importing and handling tabular data in Python. However, we need to invest some time to tidy the data in order to get the most out of it when we perform the analysis and the visualization of the results. If we achieve this, we can compute high-level and interactive plots using Bokeh and HoloViews. Tools like the Jupyter Notebooks are very powerful for visualization and data sharing. Utilizing a combination of the interactivity of JavaScript-based visualization libraries (like Bokeh and HoloViews) and efficient handling and analysis tools (like pandas), we can build useful data-analysis pipelines which can be easily shared with others.