In this module we will be discussing upcasting in NumPy. If you’re unfamiliar with casting, it’s really just changing an expression from one data type to another. There are also two types of casting: upcasting and downcasting.
And if you take a look at the diagram you can see, upcasting calls for a higher level of precision while downcasting causes you to lose data by reducing precision. In NumPy and in Python, upcasting is done implicitly. So, let’s take a quick look at an example. We’ll start with an array containing five integers and then create a second array with five float elements.
Now when we pass this through and check the Variable explorer pane.
And as expected we have one float64 array and an int64 array. Now if we join the two NumPy should automatically upcast. So let’s go ahead and use the concatenate function in NumPy.
And we can run this again
And so we still have our two arrays maintaining their original data type, but the data type of our new array is float64. And notice all the elements that were once integers have been upcast to floats.
As you may have expected upcasting can also be done explicitly and I’ll show you a couple different ways of doing that. So let’s comment the first part out and keep the same integer array we started with, and then we’re going to leverage a NumPy function called astype. And how this function works is that it has you pass in a data type as a required parameter. So we’ll say uint underscore array is equal to int underscore array dot astype, and inside the parens we’ll pass in quotation marks and then uint8 for the unsigned integer data type.
And so when we run this we’ll again have two arrays with varying data types.
Now, this last method comes with a bit of a caveat, and as we move through the example I’ll explain what I mean. In a previous guide we discussed how to create a matrix by using nested arrays. Here we are going to be doing something similar, with a small change. So, let’s start by creating a new array and we’re going to call it structured underscore array then call the array function from the NumPy library and pass in our parens and squared brackets like we wold normally do for an array. So, before when we nested an array we would pass in another set of squared brackets. But instead were going to use regular parentheses in place of the brackets. And this is where our strange little condition happens. By passing in parens in place of squared brackets we are nesting a tuple, which NumPy recognizes as a sequence. And so I’m going to create a matrix using three sequences. And each sequence is going to have three fields: the name of a professor, how many students were in their class, and the average class grade. So, I’m going to pass all this in, and when we run this i’m going to call for the dtype and shape as well.
So, we have a three by three matrix and the data type is ‘U5’ which stands for unicode with a length sequence of five. In other words, every element was implicitly upcast to a string and the longest string is five characters long.
We obviously don’t want to keep all our elements as strings so what we’re going to do is create a structured array by assigning a name and data type to each field. So, we’ll start by saying structured underscore dtypes and this is going to be equal to a list, and inside the list we’re going to nest three more sequences just like we did before. Now, we’re going to pass in the title of each column and the data type associated with it. The first column was the professor’s name so we’re going to pass in professor first followed by a comma and our desired data type. In the second sequence we’re going to pass in class_size for our field title and let’s use an unsigned integer for this one since we know we can never have a negative or fractional class size. And for our last sequence the field title is average grade, and the data type will be a float. Now we’re going to go back to our original array, add a comma between the bracket and parenthesis and finally pass in dtype equals structured_dtypes.
And now after we run this we’re going to take a look at the variable explorer pane again.
The data type we have returned for the structured array is ‘void400’ which in NumPy is essentially a flexible data type that allows us to handle fields with different data types. One of the benefits of a structured array is that it allows us to extract information by index or field. So we can pass in the zero index or the professor field depending on our task.
There is a very solid chance that most of you will not be working extensively with structured arrays as they are meant for interfacing with C code and for low-level manipulation. When working with tabular data, like a csv, pandas is still the recommended tool because of its high-level interface.