In this section we will briefly discuss three of the most popular Python libraries used in data processing and data science: NumPy, Matplotlib, and Pandas. All three of these libraries come as pre-installed packages in Anaconda, so we won’t need to go over any installation process.
As we move through this section I’ll give a quick explanation of what each library does, followed by examples to help give you a better understanding of the functionality for each library. So, if you’re confused by an initial description or some of the terminology, just know I’ll be working back through each library in more detail. At the end of this section I’ve also attached links to the User Guide for all three of these libraries. So, if you ever have syntax questions, or simply want to explore the capabilities of these libraries further, these will be a great resource.
NumPy:
Numerical Python, or NumPy, is the core library for scientific computation in Python. It offers a bunch of useful features for operations on arrays and matrices, along with a large collection of mathematical functions to manage the arrays. NumPy is also fantastic at allowing you to process large collections of data in a very efficient manner. So, usually things that would take you many lines of code to write, NumPy has those processes built directly in the library. You can simply call them and then use them in your own programs. This is something that you're going to find incredibly helpful when you start implementing machine learning algorithms or when you start building out complex API's that deal with large collections of complex data.
So, let’s start by launching Spyder, and then go ahead and import the NumPy library and assign it to the alias np.
In NumPy they use an array as the primary data type, which is very similar to a list, and can be treated in nearly the same way. To start, let’s pretend we were given a 16 element list to work with and the list contains the numbers 0 through 15 and is assigned to the variable num_list. We can use NumPy to convert the list to an array using the array function, and then store that in the variable num_array. Now, if we weren’t given the number list to start we can also leverage the range function and I’ll show you that as well. So, we’ll say num_range equals and again we’ll use the alias np we created to call the arange function and since we are using zero based indexing we can just pass in 16.
Now, we’ll go ahead and run these…
And as you can see, we have two identical arrays that are both enclosed by square brackets, just like a list.
Now, let’s say we were asked by a client to take this data, and for whatever reason they want it broken down into sub groups. NumPy gives us the ability to use nested arrays, inside our master array. And so if I call num_range and then call the function reshape I can pass in two different numbers here so if I pass in four and four. What this is going to give me, as you can see right here, is a set of four nested arrays each containing four elements inside the master array.
By nesting arrays into a master array we have essentially created a collection of arrays, which is called a matrix. This is going to be incredibly necessary when it comes to implementing some of the popular machine learning algorithms. They will require you to perform steps like this where you are going to have to create an entire collection of data and then from that point you're going to have to slice that data up into usable components such as small nested array elements exactly like how we have right here.
Matplotlib:
Our second library is Matplotlib and this is a plotting library that helps you create visual representations of your data. Some of the more common visualization forms are histograms, power spectra, bar charts, error charts, and scatter plots.
Now, let’s get right into it with an example that highlights the practicality of Matplotlib and how it can be utilized in a classroom to help improve student performance. Let’s say you’ve been hired by a University that is considering the implementation of machine learning into their curriculum. But, before they commit millions of dollars, they want you to build a system that gives real-time data to professors allowing them dynamically adjust course material. The University wants the class mean to be at least 80% comprehension before the professor moves on to the next topic.
So, how I’d like to do this is to build out a little system using conditional statements that will display a histogram that notifies the professor on whether or not more time is needed on the subject matter. Let’s start by importing pyplot from the Matplotlib library with the alias plt and NumPy as np.
Normally, our data would be pulled from an API, but for now we’ll create our own 20 element array that we will use as our input. We’ll name the variable api_data and the elements in the array are 50, 54, 57, 58, 58, 60, 61, 61, 62, 63, 65, 66, 67, 68, 68, 71, 72, 72, 76, and 82. Alright, next we’ll set up our conditional statement with an if expression stating the mean must be greater than or equal to 80 and then we won’t need an expression for the else conditional. Both statements are going to produce histograms, so let’s start putting those together by using our plt alias to call the histogram function. The first parameter the hist function takes is an array, so we can go ahead and add our api_data variable. The next parameter I’d like to add is range, which will be from 0 to 100. The third parameter we’re going to pass in will be the number of bins and that will be 20. I’m going to add a sizing parameter too, which isn’t necessary, I just think it makes it easier to distinguish the bins from one another this way. Finally, I want to add a color parameter to act as a signal to the professor. We can do that by saying color is equal to green for go. Now, I’m just going to copy this into our else conditional and change the color parameter from green to red, to act as the stop signal. That pretty much takes care of everything we need for our histogram, so let’s take a look at what we’ve done so far by using the show function.
So, it looks like everything worked properly and now we can finish up by adding our title and labels and then do some fine tuning to the tick marks. These are all pretty straight forward function calls, so we’ll start with the title function and this requires the string data type, but I also want to add the mean so I’ll have to convert that to a string as well. I’d also prefer the text to be a little bigger and in bold font. And so, we can set those parameters as well by passing in fontsize equal to 16 and fontweight equalling bold. Now, onto labeling the x-axis where we can just call the xlabel function and then add our string, which will be ‘Comprehension (%)’ and another fontweight parameter to make our font bold. We’re going to do the exact same with the y-axis but use the string ‘Number of Students’ instead. The ticks function calls for an array, so we’ll call xticks and then numpy to set our tick mark range to match our axis range which is 0 to 100. I also want the tick marks every 10 integers, not 20, so I’ll add a step parameter for that as well. Now, let’s run our program to see how it looks.
To finish up lets go ahead and comment this variable out and I’m just going to create another api_data variable right below it with a new list of elements and in this one we’ll use 75, 80, 82, 78, 69, 94, 87, 90, 72, 81, 75, 77, 100, 86, 88 ,82 ,81, 76, 83, and 74. And when we run it everything comes out perfectly.
Pandas:
The last library we’ll take a look at is the Pandas library, which was built on top of NumPy and it gives us an easy way to manipulate, examine, cleanup and merge tabular data, along with a bunch of other functions. You’ll find yourself using Pandas quit often in any data driven field because in most situations you won’t need to use an entire data frame. While it’s nice to have a large and diverse collection of data, not all of it will be relevant to what you’re working on and as we move into algorithms you’ll get a better understanding of what I mean.
Let’s start by importing our Pandas library and create an alias named pd.
Now, a lot of times you’ll see the variable df when working with data and this is short for data frame. So, to follow common practices we’ll create our df variable and assign it to the data frame we’ll be importing.
Often times when you’re unfamiliar with the data you’re importing it helps to take a quick look at the shape, size, and organizational structure, so you can get a feel for what you’re about to work with. Using our data frame we can call size to return an integer representing the number of elements, shape to return a tuple representing the dimensionality with the first element of the tuple representing the number of rows and the second the number of columns. Finally, if we say df dot columns we’ll get an array with the column labels. Also, you don’t have to assign these functions to their own variable work, you can simply pass them into the console.
As you can see our data frame has 48 rows, 7 columns, and totals 336 elements with the columns entitled year, semester, professor, course, course_title, average_grade, and dynamic_learning.
Now, if we want to take a look at the actual data but don’t want to return the entire data frame, there are a few functions that can handle that. If we want the first five rows of data returned we can leverage the head function and the default return value is 5, so we don’t need to pass anything in there. And let’s say we also want the last ten rows. We can do this by calling the tail function and passing in 10.
When we run this what’s returned are the 8 column titles, along with the first 5 and last 10 corresponding rows and if you didn’t notice already, on the far left you’ll see the index.
As I said earlier, more often than not you won’t be concerned with every element in your data frame. So many times you’ll do some sort of data manipulation to extract only what you’re interested in. If we only want to look at the first four rows from the average_grade and dynamic_learning columns we can use the ‘loc’ function. This will allow us to filter rows and select desired columns by label. Okay, since loc is a data frame method we’ll say df dot loc and pass in our square brackets and the format calls for the rows first, so we’ll pass in our index range from 0 to 3 then add our comma followed by the list of column titles including quotation marks.
When we run this we get back exactly what we wanted.
The last thing I’d like to do is break our original data frame into two smaller data frames. The first data frame will include all the rows and columns where dynamic learning wasn’t used, and the second will include all the rows and columns where dynamic learning was used. We’ll assign our first data frame to the variable no_dynamic_learning and the second will be yes_dynamic_learning. We’re going to use the ‘loc’ method again, but this time we’re going to pass in booleans. We’re going to start the same way as before by passing in df.loc followed by square brackets. Next we need to pass in the rows we want to include. But instead of passing in an index range we’re going to leverage a boolean statement by saying we only want the rows that have a ‘No’ in the dynamic_learning column. We still want all the columns, so we’ll finish by passing in a colon. We’ll do the exact same thing for the second data frame, but we’ll replace the ‘No’ with ‘Yes.
When we run the of these we see that we have two separate data frames, one without dynamic learning and the other containing all the rows and columns with dynamic learning.
So,those are just a few of the basic functions available to you in the Pandas library and as you move through this course I highly encourage you to take some time to go through all three user guides to gain a more in depth understanding of their functionality and how to take advantage of what these libraries do best.
Anaconda Package List
NumPy User Guide
Matplotlib User Guide
Pandas User Guide
FINAL CODE: