Over the next few guides we will be wrapping up the preprocessing section. And so I’d like to take some time to really dive deep into the NumPy library to help give you a high level understanding of its functionality.
So, I’d like to start by discussing some of the different data type objects, or dtypes, available to you in the NumPy library. Not including the standard Python data types, I’ve included a list of all the NumPy dtypes found in the SciPy User Guide. And as you run through the list you’ll notice some of the data types, like int_, reference ‘C’, and that is because like many high-level languages, Python was written in C. So, a few of the fundamental data types come directly from the C programming language.
You may also notice that the only data type that doesn’t require a numeric parameter is the Boolean. And the Boolean data type is a data type that has one of two possible values: true or false. This is intended to represent the two truth values of Boolean algebra. So if you’ve ever taken a logic course, or have done mathematical proofs you essentially worked with Booleans.
Next is the int data type, and as you know by now, it allows us to store a whole number, or integer, in a variable. When it comes to the 8, 16, 32, or 64 that follows the int, it can vary based on the processor in the CPU or how much memory you’d like to allocate for the data type. Sometimes when choosing an integer data type, less is more. It won’t be an issue in this course, but in general I suggest trying to use the smallest data type that can correctly store and represent your data. Smaller data types are usually faster, because they use less space on the disk, in memory, and in the CPU cache. They also generally require fewer CPU cycles to process.
The Uint, or unsigned integers, are pretty similar to regular integers except they only include positive numbers. The main benefit with the Unit data type is that since it only handles positive numbers you can possibly allocate a smaller amount of memory. So, what I mean by this is that if your working with a number range from 100 to 200 you could use the Unit8 data type because it covers all the integers from 0 to 255. But if you were to use the integer dtype you’d have to choose int16 or higher because the range for int8 ends at 127. In that example the difference in performance would probably be negligible. But as you move on in your career you’ll be working with massive data sets and those tenths or hundredths of a second can add up quickly.
We’ve already worked with floats quit a bit, and as you know floats contain two parts: the integer and the fractional part, which is separated by a decimal point. Similarly to integers and unsigned integers we have 16, 32, and 64 bit options. And I think the best way to understand the description of each if through a visual representation.
Float16:
Float32:
Float64:
So, let’s take a look at what we have and compare that to the SciPy description. Notice first that we’re dealing with a zero based bit index that works from right to left in our diagram. So we have either 10, 23, or 52 bits that are dedicated to the fractional portion of our float. And if you look in the description that corresponds exactly with the mantissa, which is just another way of saying fraction or sometimes you’ll hear it referred to as the significand. To the left of the fraction portion we have 5, 8, or 11 bits for the exponent which make up the integer portion of the float. Finally, a single bit is used to determine the positive or negative sign. By default NumPy uses the float64 data type because it offers a higher level of precision in spite of its performance and bandwidth cost. Now, I almost exclusively stick with using float64 for the simple fact that I would rather lose a little performance than unnecessarily increase my chances of error propagation. Though I would say most of the time having 10 or more decimal digits is overkill, for me it’s more of a safeguard. After all, a slide rule got us to the moon and that could only go out three decimal positions.
That last data type we’ll talk about is the complex dtype and that has 64 and 128 bit options as well. The complex data type is comprised of two separate float components: a real and imaginary number. Some of you may remember all the way back to your algebra two days, but as a reminder an imaginary number is any number, when squared, that produces a negative number. While most of you will probably never have to use this data type, it is fairly common when we are trying to describe or predict movement along a wave such as an AC current, or voice recognition applications.
Well, that wraps up this guide. And I know it probably wasn’t one of the more exciting topics, but having a solid understanding of what types of data you can and will be using will play a major role in your success as a machine learning developer.
DType
Description
bool_
Boolean: True or False
int_
Default integer type: Same as C long, normally either int32 or int64
intc
Identical to C int: Normally int32 or int64
intp
Integer used for indexing: Same as C ssize_t, normally int32 or int64
int8
Byte: Can store values from -128 to 127
int16
Integer: Can store values from -32,768 to 32,767
int32
Integer: Can store values from -2,147,483,648 to 2,147,483,647
int64
Integer: Can store values from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
uint8
Unsigned integer: Can store values from 0 to 255
uint16
Unsigned integer: Can store values from 0 to 65,535
uint32
Unsigned integer: Can store values from 0 to 4,294,967,295
uint64
Unsigned integer: Can store values from 0 to 18,446,744,073,709,551,615
float
Short for float64
float16
Half precision float: Sign bit, 5 bits exponent, 10 bits mantissa
float32
Single precision float: Sign bit, 8 bits exponent, 23 bits mantissa
float64
Double precision float: Sign bit, 11 bits exponent, 52 bits mantissa
complex_
Short for complex128
complex64
Complex number: Represented by two 32-bit floats (real and imaginary components)
complex128
Complex number: Represented by two 64-bit floats (real and imaginary components)