8.2.1 Terminology
Python is a programming language that has become popular for data science and machine learning (Guttag 2013). A Jupyter notebook, which is denoted by the file format.ipynb, is a document in which you can write and run Python code. It consists of cells, which can contain either Markdown (text) or code. Each cell can be executed independently, and the results of any code executed are “saved” until the file is closed. Raw data files are often comma-separated values (CSV) files which store tabular data in plain text. Each record consists of values (can be numeric or text) separated by commas. To see an example, open the accompanying dataset births.csv in Notepad and examine its contents. You can also open it in Excel for a tabular view.
There are many useful libraries or modules in Python which can be imported and called to make our lives easier and more convenient. SciPy is an ecosystem of Python libraries for math and science. The core libraries include NumPy, Pandas and Matplotlib. NumPy (typically imported as np) allows you to work efficiently with data in arrays. Pandas (typically imported as pd) can load csv data into dataframes which optimize storage and manipulation of data. Dataframes have useful methods such as head, shape, merge etc. The pyplot module (typically imported as plt) in matplotlib contains useful functions for generating simple plots e.g. plot, scatter, hist etc. You will encounter these libraries and their functions in the demo and hands-on exercise later.
8.2.2 Basic Built-in Data Types
The basic built-in data types you should be familiar with in Python are integer, float, Boolean, string and list. Examples of each type are as follows:
Integer
|
7
|
Float
|
7.0
|
Boolean
|
True, False
|
String
|
‘Hi’, “7.0”
|
List
|
[], [‘Hello’, 70, 2.1, True]
|
Strings can be enclosed by either single or double quotation marks. Lists are collections of items, which can be of different types. They are indicated by square brackets, with items separated by commas. Unlike older programming languages like C, you do not need to declare the types of your variables in Python. The type is inferred from the value assigned to the variable.
8.2.3 Python Demo
You do not need to be a Python expert in order to use it for machine learning. The best way to learn Python is simply to practice using it on several datasets. In line with this philosophy, let us review the basics of Python by seeing it in action.
Open Anaconda Navigator and launch Jupyter Notebook. In the browser that pops up, navigate to the folder where you have saved the accompanying files to this chapter. Click on demo.ipynb. In this notebook, there are a series of cells containing small snippets of Python code. Clicking the “play” button (or hitting Shift + Enter) will execute the currently selected (highlighted) cell. Run through each cell in this demo one by one—see if you understand what the code means and whether the output matches what you expect. Can you identify the data type of each variable.
In cell 1, the * operator represents multiplication and in cell 2, the == operator represents equality. In cell 3, we create a list of 3 items and assign it to lst with the = operator. Note that when cell 3 is executed, there is no output, but the value of lst is saved in the kernel’s memory. That is why when we index into the first item of lst in cell 4, the kernel already knows about lst and does not throw an error.
Indexing into a list or string is done using square brackets. Unlike some other programming languages, Python is zero-indexed, i.e. counting starts from zero, not one! Therefore, in cells 4 and 5, we use [0] and [1:] to indicate that we want the first item, and the second item onwards, respectively.
In cell 6, we ask for the length of lst with the built-in function len(). In cell 7, we create a loop with the for…in… construct, printing a line for each iteration of the loop with print(). Note that the number ‘5’ is not printed even though we stated range(5), demonstrating again that Python starts counting from zero, not one.
In cell 8, we define our own function add() with the def and return keywords. There is again no output here but the definition of add() is saved once we execute this cell. We then call our function add() in cell 9, giving it two inputs (arguments) 1 and 2, and obtaining an output of 3 as expected.
In cell 10, we define a more complicated function rate() which when given a letter grade (as a string), outputs a customized string. We create branches within this function with the if…elif…else construct. One important thing to note here is the use of indentation to indicate nesting of code. Proper indentation is non-negotiable in Python. Code blocks are not indicated by delimiters such as {}, only by indentation. If indentation is incorrect (for example if this block of code were written all flushed to the left), the kernel would throw an error. In cells 11 and 12, we call our function rate() and check that we obtain desired outputs as expected.
Taking a step back, notice how Python syntax is close to plain English. Code readability is important for us to maintain code (imagine coming back 6 months later and realizing you cannot make sense of your own code!) as well as for others to understand our work.
It is not possible (nor necessary) to cover everything about Python in this crash course. Below I have compiled a list of common operators and keywords into a “cheat sheet” for beginners.
Arithmetic
|
+, -, *, /, %, **, //
|
Comparison
|
== , ! = , > , < , >=, <=
|
Boolean logic
|
and, or, not
|
Indexing lists/strings
|
[n], [n:m], [n:], [:n]
|
Selection
|
if, elif, else
|
Iteration/loop
|
for, in, range
|
Create function
|
def, return
|
Call function
|
function(arg1, arg2, …)
|
Call object’s method or library’s function
|
object.method(arg1, arg2, …)
library.function(arg1, arg2, …)
|
Get length of list/string
|
len(…)
|
Import library
|
import … as …
|
Print
|
print()
|
8.2.4 Python Exercise
You are now ready to practice your Python skills. Open the notebook python.ipynb and give the exercise a shot. In this exercise, we will practice some simple data exploration, which is an important aspect of the data science process before model-building. Try to give your variables descriptive names (e.g. “age”, “gender” are preferable to “a”, “b”). If you are stuck, refer to python_solutions.ipynb for suggested solutions. Read on for more explanations.
In the very first cell, we import the libraries we need (e.g. pandas) and give them short names (e.g. pd) so that we can refer to them easily later. In Q1, we read in the dataset into a pandas dataframe births by calling the read_csv() function from pd. Note that the data file births.csv should be in the same folder as the notebook, otherwise you have to specify its location path. births is a dataframe object and we can call its methods head and shape (using the object.method notation) to print its first 5 rows and its dimensions. Note that the shape of dataframes is always given as (number of rows, number of columns). In this case, we have 400 rows and 3 columns.
It is worth spending some time at this juncture to clarify how we index into 2D arrays such as dataframes, since it is something we commonly need to do. The element at the n-th row and the m-th column is indexed as [n, m]. Just like lists, you can get multiple array values at a time. Look at the figures below and convince yourself that we can index into the blue elements of each 2D array by the following commands. Remember, Python is zero-indexed.
In Q2, we call the mean method to quickly obtain the mean value for each column in births. In Q3, we create 3 copies of the births dataframe—group1, group2 and group3. For each group, we select (filter) the rows we want from births based on maternal age. Note the use of operators to specify the logic. We then apply shape and mean methods again to obtain the number of births and mean birth weight for each group and print() them out.
In Q4, we call scatter() from the pyplot module (which we have earlier imported as plt) to draw a scatterplot of data from births, specifying birth_weight as the x-axis, and femur_length as the y-axis. Note the use of figure() to start an empty figure, xlabel() and ylabel() to specify the axis labels, and show() to print the figure.
The code in Q5 is similar, except that we call scatter() 3 times, using data from group1, group2 and group3 instead of births, and specifying the different colors we want for each group. We use legend() to also include a key explaining the colors and their labels in the figure. If we wanted to add a figure title, we could have done that with title().