NumPy#
- NumPy stands for Numerical Python
- It is a Python library with tools made for working with multidimensional array objects
- Important: NumPy array is different than Python list and Standard Python array!
- Python array can handle only one dimensional arrays with much less functionality
- In this lecture we are going through the following topics related to NumPy:
- NumPy arrays
- Array indexing
- Array slicing
- Array shape
- Before we dive into these topics, it is essential to examine how NumPy array differs from regular lists in Python
- Python lists are resizable and one list can contain different types of elements
- NumPy arrays have higher performance and take up less space than Python lists
- NumPy arrays reserve sequential memory slots in memory where as Python lists can be spread around memory area (random positions in memory)
- In addition, NumPy arrays have built-in optimized functions (for example linear algebra)
NumPy array vs Python list performance#
- Let's test the performance between NumPy array and Python list
- Below is an example where performance test is run for both with 1000000 elements
import time
import numpy as np
size_of_vec = 1000000
def python_list():
t1 = time.time()
X = range(size_of_vec)
Y = range(size_of_vec)
Z = [X[i] + Y[i] for i in range(len(X))]
return time.time() - t1
def numpy_array():
t1 = time.time()
X = np.arange(size_of_vec)
Y = np.arange(size_of_vec)
Z = X + Y
return time.time() - t1
count1 = python_list()
count2 = numpy_array()
print("****** Test run with {} elements ******".format(size_of_vec))
print("Python list: {}".format(count1))
print("NumPy array: {}".format(count2))
print("NumPy array is " + str(count1/count2) + " faster!")
- As can be seen from the results, NumPy array outperformes regular Python list easily
NymPy array vs Python list space consumption#
- Earlier we proved that NumPy array performance is significantly higher compared to Python list
- In this section we check how does these compare in size they reserve in memory
- Below we have an illustration of memory consumption principles of NymPy array and Python list
- As mentioned earlier, array memory reservation is basically sequential memory slots for the length of array
- Since Python list offers the flexibility to save elements of different types (int, float, str...), it also consumes more memory
- Important: You may also save different types of elements to NumPy array, but then it takes also much more space!
- List object refers to a memory address which again contains all the required information for list object (data, type, reference count)
- Reference count tells how many times this object's memory address have been referenced
- Each new list object requires 8 bytes more memory
import sys
arr = range(1000000)
# Python list
# Size of single element
print("Size of single element in Python list: {} bytes".format(sys.getsizeof(arr)))
# Size of whole list
print("Size of whole Python list: {} bytes".format(sys.getsizeof(arr)*len(arr)))
# NumPy array
numpy_arr = np.arange(1000000)
# Size of single element
print("Size of single element in NumPy array: {} bytes".format(numpy_arr.itemsize))
# Size of whole NumPy array
print("Size of whole NumPy array: {} bytes".format(numpy_arr.size * numpy_arr.itemsize))
NumPy array basics#
- NumPy array data type is ndarray
- The following attributes of NumPy array are discussed in this section:
- dtype
- size
- itemsize
- shape
- ndim
- Below is an example where we define a simple one-dimensional NumPy array
arr = np.array([1,2,3,4,5,6,7,8,9,10])
arr
- We can also utilize the arange routine to create same array as above with evenly spaced values
- The following are the useful parameters for arange:
- start: start of the interval (default 0)
- stop: stop of the interval
- step: spacing between values (default 1)
- Parameter amount acts as following:
- 1 parameter = stop
- 2 parameters = start and stop
- 3 parameters = start, stop and step
- Basically we use arange routine by giving it the starting value (1) and the ending value, which will not be included (11)
arr2 = np.arange(1,11)
arr2
Data type (dtype)#
- Now let's check the data type for the array and elements stored in the array
# Data type for the array
print(type(arr))
# Data type for the array elements
print(arr.dtype)
- As can be seen from the example above, array type is ndarray and it contains elements with int32 (32-bit signed integer)
Size of array and its elements#
- For examining the size measures of array and its elements we use
- size: get the size of NumPy array (the amount of elements in array)
- itemsize: get the size of single element in NumPy array
- Next we create a simple two-dimensional array with arange routine and get the measures
arr3 = np.arange(25)
arr3
print("Size of the array: {} items".format(arr3.size))
print("Size of a single element: {} bytes".format(arr3.itemsize))
print("Size of the array in bytes: {} bytes".format(arr3.size * arr3.itemsize))
Shape#
- In NumPy array shape presents the number of elements in each dimension
- In this section we describe the following:
- shape: retrieve the array dimensions (returns the dimensions in the following format: x,y where x presents the rows and y the columns)
- reshape: manipulate the dimensions of existing array (takes two parameters: x,y where x is the row count and y is the column count)
- Below is an example where we first create a one-dimensional array containing 15 elements and then transform it into two-dimensional array
# One-dimensional array
numbers = np.arange(15)
numbers
- Now let's return the dimension information
numbers.shape
- As can be seen from the output, there is only one row and all elements are contained within this row
- Let's then do the transformation into two dimensional array with three rows and five columns
# Two-dimensional array (three rows, five columns)
numbers_two = numbers.reshape(3,5)
numbers_two
- Now we can recheck the shape information
numbers_two.shape
Number of array dimensions (ndim)#
- Number of dimensions basically describes whether the NumPy array has one, two, three or n dimensions
- In order to retrieve this information we need to use ndim like shown in the example below
# Let's build a two-dimensional array (25 elements in a shape of 5 rows and 5 columns)
new_arr = np.arange(25).reshape(5,5)
new_arr
new_arr.ndim
Mathematical operations#
- Mathematical operators work in a completely different way with NumPy compared to regular Python lists
- Below is an example for using a couple of mathematical operators for both
p1 = [1,2,3,4,5]
p2 = [6,7,8,9,10]
result_arr = p1 + p2
result_arr
- Addition of two lists works like append method for lists
- Now let's run the same for NumPy arrays
n1 = np.array([1,2,3,4,5])
n2 = np.array([6,7,8,9,10])
result_arr = n1 + n2
result_arr
- As can be seen from the output each element in corresponding index positions were added up
- Important: If both arrays have more than one element, NumPy requires that you are adding the same amount elements or you would receive an error like shown below
n3 = np.array([6,7,8,9])
result_arr = n1 + n3
result_arr
- Mathematical operations could easily be targeted to array elements
- Below is an example where we multiply each array element by 3
result_arr = n1*3
result_arr
Array conditions#
- Comparison operators can be used to retrieve either boolean information (True, False) of each element or the elements itself
- In the example below we check how many elements are 10 or greater
new_arr = np.array([1,4,6,9,11,12,15,16,17,21,22,27,30])
# Get boolean results
result = new_arr >= 10
result
# Get values as a result
result = new_arr[new_arr >= 10]
result
- When there is a need to filter results with multiple conditions, where should be used
- Below is an example where all elements between 10 and 20 (inclusive) are filtered to result set
query = np.where((new_arr >= 10) & (new_arr <= 20))
result = new_arr[query]
result
Random numbers with NumPy#
- Random number can be defined as a number which value cannot be predicted
- Random number will be generated usually from predefined range, for example between 10 and 100
- Python has several ways to generate random numbers, but in this section we focus on generating random numbers with NumPy library
- Random number generation can be done by using the following functions from NumPy library (we only cover a couple of them):
- rand(d0,d1,...dn): returns random floats generated to a given shape (dimensions)
- randint(low,high,size,dtype): returns random integers between low (inclusive) and high (exclusive)
- random(size): returns random floats
# Generate random floats for five rows and ten columns (two-dimensional array)
num_arr = np.random.rand(5,10)
num_arr
# Generate ten random integers between 20 and 50 into one-dimensional array
num_arr = np.random.randint(20,50,10)
num_arr
# Generate 20 random floats into one-dimensional array
num_arr = np.random.random(20)
num_arr
Managing outlier values#
- Outlier value can be defined as a distant value from other values
- We use so called Z-score to define whether the value is outlier or not
- Z-score tells us how much the value differs from the standard deviation
- The idea behind standard deviation is presented below (mean value is presented in the middle)
Source: https://en.wikiversity.org/wiki/Normal_distribution
- This kind of value can become present due to following reasons:
- Mistake made during the data collection
- Greater variance in collected data
- In the upcoming visualization section we will cover the scatter plot which is a great tool for spotting outlier values
- In addition to visualization tools we present Z-score method here
- For checking outlier values using z-score, we define the mean value standard deviation for the whole dataset first
target_values = [0.32829068, 0.91383474, 0.46572627, 1.00451282, 0.15292166, 0.15431924, 1.593784, 1.39104274, 1.37555429, 1.51761099, 1.27565507, 0.11552234, 0.38350334, 0.5234307, 1.37250195, 0.52892224, 0.71040854, 1.36889979, 1.04287721, 0.8102497, 0.22675369, 0.05733249, 0.16209903, 1.66554992, 1.00455368, 1.02013751, 0.14157334, 0.32883902, 0.48589071, 1.20919122, 1.75412208, 1.01631983, 1.02737197, 4.70497265, 0.91572539, 0.43262401, 0.05860171, 1.60976123, 0.89028701, 0.71612273, 5.0892402, 0.58976794, 1.01826693, 0.47667937, 1.24278374, 1.49998996, 1.00197125, 0.77574254, 0.50276422, 0.33406615, 0.87261939, 0.39189183, 0.91028125, 1.45755418, 1.57665023, 1.20949708, 1.11155048, 0.03730547, 1.21159563, 0.38000365, 0.87691568, 1.29910676, 1.45481045, 0.69755388, 1.25324359, 0.90087526, 1.2329156, 1.72759341, 0.27011248, 0.7556189, 0.0397281, 0.54531398, 1.530289, 0.31091256, 1.53493279, 0.28954487, 0.32347762, 1.42613795, 1.74062006, 0.70645862, 1.68256253, 0.65720741, 0.90250029, 1.50248087, 0.08352546, 1.31547595, 0.52514633, 1.45910302, 1.05784051, 0.60720322, 1.66990581, 0.65655435, 0.28681604, 0.21281041, 0.73920276, 0.45866529, 0.05322463, 1.6974098, 0.92246391, 1.24886213, 0.53357048, 0.20693448, 0.58927456, 0.78173771, 0.08512026, 0.4395528, 0.20247776, 1.38413612, 0.6866888, 0.82242901, 0.25653366, 1.77735088, 0.55459209, 0.06966439, 1.06397743, 1.29705384, 0.19517814, 1.23590563, 1.51629999, 0.77159672, 0.09963431, 0.5644203, 1.12804111, 0.83043601, 1.74946689, 0.73413819, 0.73622908, 0.40462713, 0.98179531, 0.33493697, 1.09067284, 1.02206675, 3.24059011, 1.52932619, 1.54737447, 0.03416107, 1.29228287, 0.97177852, 0.18573451, 0.4058918]
mean = np.mean(target_values)
std = np.std(target_values)
print("Mean value for the values is {} and standard deviation is {}".format(mean,std))
- Now since there is a lot of data present it is quite difficult to spot outlier values
- Let's set a threshold value in order to spot values which are outside the first three standard deviations (99.7 % of all values)
threshold = 3
outliers = []
for i in target_values:
if (i-mean)/std > threshold:
outliers.append(i)
print("Outlier values: {}".format(outliers))
- As can be seen from the results three outlier values were found from the dataset
- These could now easily be filtered out from the dataset by using the threshold value
filtered_values = [i for i in target_values if (i-mean)/std <= threshold]
print(filtered_values)