Introduction to NumPy, MatPlotlib And Pandas

Next up

Summarizing Data

Continuing in

Data Analytics of Stock Price Movements with Python

This is a preview of subscription content

Your browser needs to be JavaScript capable to view this video

Try reloading this page, or reviewing your browser settings

Autoplay: Autoplay Off Autoplay On

View previous videoPrevious video View next video

This segment covers the overview of the main third party libraries employed for data analysis.

Keywords

numpy
pandas
matplotlib
data
analysis
data science
visualization

About this video

Author(s): Matthew Macarty
First online: 07 March 2020
DOI: https://doi.org/10.1007/978-1-4842-5647-3_1
Online ISBN: 978-1-4842-5647-3
Publisher: Apress
Copyright information: © Matthew Macarty 2020

Video Transcript

Continuing on with our story, in this video, we’re going to be taking a look at NumPy, Matplotlib, and Pandas, and these are the three main workhorses that we’ll be using throughout this course. It’s not going to be a comprehensive coverage of each. It’s just going to be a short introduction, and then we’ll cover each of these libraries as needed as we go along. And then we’ll cover concepts within each of these libraries as we go along and get a little bit more in-depth.

So I’m going to go ahead and jump into a Jupyter Notebook. OK, and then the first thing we’re going to do is set up our environment by importing the libraries that we need. So I’m going to import NumPy, import Pandas, import Matplotlib, and then, finally, I’m going to set the notebook so that it displays the graphs internally.

With that done, we’ll run this cell. And there’s no reason why we have to do this at the beginning, but it is kind of the convention that you put all your imports at the beginning of your notebook.

OK, so I’ll do a quick overview of NumPy. And much of the power from NumPy comes from its n-dimensional array. So we can create one with np_array, and then whatever I put inside here will become an array. It does require all the data to be of the same type. I’ll just use the range object and make an array with 10 items in it, and then I’ll just call that.

OK, so at first glance, it looks essentially the same as a Python list, but it’s actually a lot more powerful. So one of the main things we like about this is this idea of broadcast calculations. So if I want to, say, square all the values in there, rather than writing a for loop, all I have to do is tell the operation, and it gets applied to every item in the array.

There’s also a number of convenience functions, such as the mean. We can get the min. We can get the standard deviation. And then a lot of times what we want to do is be progressively adding things up. So cumulative sum is a nice function.

All right, and we’re also going to be interested in a module within NumPy, its random module. So I’m just going to create a shortcut to the random module so it makes it easier for me to generate random numbers as needed. And then we’ll demonstrate it.

So here is just a standard random between zero and one. We use this for probability a lot of times. And then a lot of the financial calculations are going to rely on the standard normal distribution. So there’s a value from the standard normal. There’s another one. And if I need to create an array full of standard normal random variates, I just specify how many I need.

So we’ll move on and take a quick look at Matplotlib using some random numbers we generate from NumPy. So I’ll store in this variable x variates from a normal distribution where I specify the mean and the standard deviation, and then I specify how many I want. And then as I do that, I’m going to generate a cumulative sum.

So this array is not going to contain just normal random variates. It’s going to contain a running total of them. And then, to make it sort of look like a stock, I’m going to take x and set it equal to some stock that’s at 100, and I’m going to add to that 100 times whatever’s in x.

With that done, we’re ready to go ahead and plot it. So I’ll call the Matplotlib library with my shortcut. And then it’s just plot. So the three main plots we’re going to see in this video are the line plot, scatter plot, and a histogram, and these are the typical graphs you see in primary data analysis. So it’s as simple as that.

So we can see that this stock randomly started, which we thought started– 100 started around 97.50, and it traveled along randomly and then finished up around 82.50. If I generate another sample here and plot it again, we’re going to get something completely different. So it went up, in this case, and then later it went down.

To generate the scatter plot, I’m going to do something similar. I’m going to generate some x variables, and I’m going to, again, use that NumPy random module, and I’m going to use the standard normal. And I’m going to get 30 of them. We’ll generate some for y. Just get these from the random distribution, which is a uniform distribution from zero to one. I’ll get 30 of those.

And we could graph the data right in the cell, but my preference is to use multiple cells so that I can regraph or rerun data separately. So to get the scatter plot, it’s going to be plot scatter, and then we had to pass in an x and y. At a minimum, we’ll pass in an x and a y, and we’ll see what that looks like.

So as you might have expected, if you know a little bit about a scatter plot, it shows a relationship between two variables. If I graph two random variates against each other, we shouldn’t see any kind of relationship there. And that’s what the scatter plot is telling us. If I run this again and if I run it several times, I can probably make it look like there’s a relationship there. But really there shouldn’t be one.

There’s also parameters we can set with Matplotlib to make the graphs look a little bit better or different. So there’s this alpha variable that I set somewhere between zero and one, and it sets the transparency of the points being plotted. So if I set it at 0.5, we’ll see the plots look a bit more transparent, and really they just show up as sort of a lighter blue. So the white background is starting to show through. So this is used a lot so we can see density along the scatter plot where there is a lot of data points.

And then the last graph we’re going to show is the histogram. And I’ll set a new variable here. We’ll call it z. And again, we’ll get in that random module. We’ll sample the standard normal. And we’ll get a million of those. So this is going to demonstrate a couple of things. First of all, how normal is the normal when we sample it? Second of all, how quickly can we generate a million of them? And you saw it just there very quickly.

So for our histogram, it’s plt.hist and then z, and we’ll see what that looks like. So by default, the histogram is going to give us about 10 bis here– so not very descriptive. I’m going to change a couple of parameters here. So I’m going to give it a bins argument and set them to 50.

I’m going to change the color. And there’s lots of options for colors. This is going to be a green. And then I am going to set the edge color so that we can see sort of delineations of each bin of data. And that’s white. So let’s take a look at that.

OK, so you can see, when I do that, now it starts to look a lot more like a normal distribution, a standard bell curve. And we also see a few of the options for making our graphs a little bit more appealing.

You notice that when I plot this, I get this big data dump here. So this is showing us actually where the bins start and end, all 50 of them. We can suppress that by storing the graph in a variable. So a lot of times, we don’t need all that data being visible in our notebook, and you’ll see us do this a lot. We’ll store a graph in a variable. And this is one of the reasons you do it.

And then, finally, we’re going to cover briefly some of Pandas, and this is going to be the main library we’re going to be working with for our data analysis techniques. So to get this started, I’m going to just fill up a NumPy array, again, with data from the standard normal.

And rather than just a single-dimensional array, I’m going to make a two-dimensional array. So I need to use this tuple syntax inside the arguments for the normal. So I’m going to get 100 rows, five columns of data. And once again, I am going to get a cumulative sum of this, and then I’m going to specify along what axis. So that specifies along the row. If I want to do a cumulative sum along the columns, then I’d use axis one.

So our data’s now sitting out there in a NumPy array, and then I’m just going to set a variable for the Pandas data frame. And I’m going to do something similar to what I did before. I’m going to take 100, and I’m going to add to that 100 times whatever’s in a.

OK, so this is demonstrating a couple things. We did it simply with a NumPy array above. We did this broadcast calculation. Here I have 500 variables spread across five columns. And with a simple line of code here or a simple expression, I can broadcast calculate all through the array.

I’m going to separate the parameters out on different lines so they’re easier to see. So after the comma, I can go on to a new line. I’ll set an index. And my index, since we’re simulating stock prices, we’re going to use a simulated date. So that’s date range.

And this is a method inside of Pandas. I give it a starting date, and I give it a frequency. So that’s business days, Monday through Friday. And then I give it a number of periods. So the periods are going to have to match the periods or the rows inside our NumPy array.

And then I’m going to give some column names. So when you’re naming columns, it has to be a list-like structure. I’ll just make a list out of the first five letters of the alphabet here. And with that done, we should be ready to make our data frame.

And with that done, we should be able to make our data frame. And with all that done, we should be able to make our data frame. And then the nice thing about the data frame is it has methods for easily accessing the first few rows or the last few rows of your data.

So if I want the first few rows, I’m going to use head. If I want to specify a number in here, I can. Otherwise, I’m going to get the first five rows in this situation. So there is our data, and then here’s one with three rows. And then if you want the last few rows of data, we can use tail.

And then the nice thing about Pandas, along with that broadcast calculation that we have from NumPy, we can subset the data easily. So if I just want the A column, I can use that syntax. You can see the output’s a little different. So when you just get a single column, it turns the object into a different Pandas object, a series. We can also use this syntax to subset a column.

So sometimes they work the same. Sometimes they work a little differently. As long as your column name doesn’t have spaces in it, the first syntax with a dot should work. But it can’t be guaranteed. Pandas is an evolving, immature library. Some things remain constant as they update it. Some things change.

So you shouldn’t be surprised when one day you’re doing something in Pandas and it works, and the next day it doesn’t. There are a few inconsistencies using the different syntaxes. This syntax will work all the time. And then if we want more than one column, I have to use the double list notation.

And then I separate the columns with a comma. And then if I just want the first few rows, I tack head on the end. So you can see, as soon as I get more than one column or more than one row, I’m going to end up with a data frame again.

So that’s subsetting by columns. We can also subset by rows a couple of different ways. We can use the iloc, which is an index location. And so the index location is going to be integer based, and it’s going to be zero-based indexing. So the first row is going to be zero.

So when we just get a single row, you can see, yep, we got a series again. If we want to get more than one row, I use the double list notation, and then I just start selecting the rows I’m interested in. And as soon as I get more than one row, the output is going to be a data frame.

The other method to subset by row is the loc method, and the loc is going to take a date label itself. So I have to put it in quotation marks. So essentially that is the zero-th row for the index location, and then the location is the actual index label, which is a date in this case.

So that should be enough to get us going with what we’re going to be doing in the follow-on videos. So next step, we’ll dive a little deeper into probably all these libraries.