NumPy#

NumPy stands for Numerical Python
It is a Python library with tools made for working with multidimensional array objects
Important: NumPy array is different than Python list and Standard Python array!
Python array can handle only one dimensional arrays with much less functionality
In this lecture we are going through the following topics related to NumPy:
- NumPy arrays
- Array indexing
- Array slicing
- Array shape
Before we dive into these topics, it is essential to examine how NumPy array differs from regular lists in Python
- Python lists are resizable and one list can contain different types of elements
- NumPy arrays have higher performance and take up less space than Python lists
- NumPy arrays reserve sequential memory slots in memory where as Python lists can be spread around memory area (random positions in memory)
- In addition, NumPy arrays have built-in optimized functions (for example linear algebra)

NumPy array vs Python list performance#

Let's test the performance between NumPy array and Python list
Below is an example where performance test is run for both with 1000000 elements

import time
import numpy as np

size_of_vec = 1000000

def python_list():
    t1 = time.time()
    X = range(size_of_vec)
    Y = range(size_of_vec)
    Z = [X[i] + Y[i] for i in range(len(X))]
    return time.time() - t1

def numpy_array():
    t1 = time.time()
    X = np.arange(size_of_vec)
    Y = np.arange(size_of_vec)
    Z = X + Y
    return time.time() - t1


count1 = python_list()
count2 = numpy_array()
print("****** Test run with {} elements ******".format(size_of_vec))
print("Python list: {}".format(count1))
print("NumPy array: {}".format(count2))
print("NumPy array is " + str(count1/count2) + " faster!")

****** Test run with 1000000 elements ******
Python list: 0.25601911544799805
NumPy array: 0.0039997100830078125
NumPy array is 64.00941821649977 faster!

As can be seen from the results, NumPy array outperformes regular Python list easily

NymPy array vs Python list space consumption#

Earlier we proved that NumPy array performance is significantly higher compared to Python list
In this section we check how does these compare in size they reserve in memory
Below we have an illustration of memory consumption principles of NymPy array and Python list
As mentioned earlier, array memory reservation is basically sequential memory slots for the length of array

NumPy array memory consumption

Since Python list offers the flexibility to save elements of different types (int, float, str...), it also consumes more memory
Important: You may also save different types of elements to NumPy array, but then it takes also much more space!
List object refers to a memory address which again contains all the required information for list object (data, type, reference count)
Reference count tells how many times this object's memory address have been referenced
Each new list object requires 8 bytes more memory

Python list memory consumption

import sys

arr = range(1000000)

# Python list

# Size of single element
print("Size of single element in Python list: {} bytes".format(sys.getsizeof(arr)))

# Size of whole list
print("Size of whole Python list: {} bytes".format(sys.getsizeof(arr)*len(arr)))


# NumPy array

numpy_arr = np.arange(1000000)

# Size of single element
print("Size of single element in NumPy array: {} bytes".format(numpy_arr.itemsize))

# Size of whole NumPy array
print("Size of whole NumPy array: {} bytes".format(numpy_arr.size * numpy_arr.itemsize))

Size of single element in Python list: 48 bytes
Size of whole Python list: 48000000 bytes
Size of single element in NumPy array: 4 bytes
Size of whole NumPy array: 4000000 bytes

NumPy array basics#

NumPy array data type is ndarray
The following attributes of NumPy array are discussed in this section:
- dtype
- size
- itemsize
- shape
- ndim
Below is an example where we define a simple one-dimensional NumPy array

arr = np.array([1,2,3,4,5,6,7,8,9,10])
arr

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

We can also utilize the arange routine to create same array as above with evenly spaced values
The following are the useful parameters for arange:
- start: start of the interval (default 0)
- stop: stop of the interval
- step: spacing between values (default 1)
Parameter amount acts as following:
- 1 parameter = stop
- 2 parameters = start and stop
- 3 parameters = start, stop and step
Basically we use arange routine by giving it the starting value (1) and the ending value, which will not be included (11)

arr2 = np.arange(1,11)
arr2

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

Data type (dtype)#

Now let's check the data type for the array and elements stored in the array

# Data type for the array
print(type(arr))

# Data type for the array elements
print(arr.dtype)

<class 'numpy.ndarray'>
int32

As can be seen from the example above, array type is ndarray and it contains elements with int32 (32-bit signed integer)

Size of array and its elements#

For examining the size measures of array and its elements we use
- size: get the size of NumPy array (the amount of elements in array)
- itemsize: get the size of single element in NumPy array
Next we create a simple two-dimensional array with arange routine and get the measures

arr3 = np.arange(25)
arr3

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24])

print("Size of the array: {} items".format(arr3.size))
print("Size of a single element: {} bytes".format(arr3.itemsize))
print("Size of the array in bytes: {} bytes".format(arr3.size * arr3.itemsize))

Size of the array: 25 items
Size of a single element: 4 bytes
Size of the array in bytes: 100 bytes

Shape#

In NumPy array shape presents the number of elements in each dimension
In this section we describe the following:
- shape: retrieve the array dimensions (returns the dimensions in the following format: x,y where x presents the rows and y the columns)
- reshape: manipulate the dimensions of existing array (takes two parameters: x,y where x is the row count and y is the column count)
Below is an example where we first create a one-dimensional array containing 15 elements and then transform it into two-dimensional array

# One-dimensional array
numbers = np.arange(15)
numbers

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

Now let's return the dimension information

numbers.shape

(15,)

As can be seen from the output, there is only one row and all elements are contained within this row
Let's then do the transformation into two dimensional array with three rows and five columns

# Two-dimensional array (three rows, five columns)
numbers_two = numbers.reshape(3,5)
numbers_two

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

Now we can recheck the shape information

numbers_two.shape

(3, 5)

Number of array dimensions (ndim)#

Number of dimensions basically describes whether the NumPy array has one, two, three or n dimensions
In order to retrieve this information we need to use ndim like shown in the example below

# Let's build a two-dimensional array (25 elements in a shape of 5 rows and 5 columns)
new_arr = np.arange(25).reshape(5,5)
new_arr

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

new_arr.ndim

Mathematical operations#

Mathematical operators work in a completely different way with NumPy compared to regular Python lists
Below is an example for using a couple of mathematical operators for both

p1 = [1,2,3,4,5]
p2 = [6,7,8,9,10]

result_arr = p1 + p2
result_arr

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Addition of two lists works like append method for lists
Now let's run the same for NumPy arrays

n1 = np.array([1,2,3,4,5])
n2 = np.array([6,7,8,9,10])

result_arr = n1 + n2
result_arr

array([ 7,  9, 11, 13, 15])

As can be seen from the output each element in corresponding index positions were added up
Important: If both arrays have more than one element, NumPy requires that you are adding the same amount elements or you would receive an error like shown below

n3 = np.array([6,7,8,9])

result_arr = n1 + n3
result_arr

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-38-abec1e02326e> in <module>
      1 n3 = np.array([6,7,8,9])
      2 
----> 3 result_arr = n1 + n3
      4 result_arr

ValueError: operands could not be broadcast together with shapes (5,) (4,)

Mathematical operations could easily be targeted to array elements
Below is an example where we multiply each array element by 3

result_arr = n1*3
result_arr

array([ 3,  6,  9, 12, 15])

Array conditions#

Comparison operators can be used to retrieve either boolean information (True, False) of each element or the elements itself
In the example below we check how many elements are 10 or greater

new_arr = np.array([1,4,6,9,11,12,15,16,17,21,22,27,30])

# Get boolean results
result = new_arr &gt;= 10
result

array([False, False, False, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True])

# Get values as a result
result = new_arr[new_arr &gt;= 10]
result

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24])

When there is a need to filter results with multiple conditions, where should be used
Below is an example where all elements between 10 and 20 (inclusive) are filtered to result set

query = np.where((new_arr &gt;= 10) &amp; (new_arr &lt;= 20))
result = new_arr[query]
result

array([11, 12, 15, 16, 17])

Random numbers with NumPy#

Random number can be defined as a number which value cannot be predicted
Random number will be generated usually from predefined range, for example between 10 and 100
Python has several ways to generate random numbers, but in this section we focus on generating random numbers with NumPy library
Random number generation can be done by using the following functions from NumPy library (we only cover a couple of them):
- rand(d0,d1,...dn): returns random floats generated to a given shape (dimensions)
- randint(low,high,size,dtype): returns random integers between low (inclusive) and high (exclusive)
- random(size): returns random floats

# Generate random floats for five rows and ten columns (two-dimensional array)
num_arr = np.random.rand(5,10)

num_arr

array([[0.455673  , 0.34983164, 0.63719532, 0.62558773, 0.61995063,
        0.51437659, 0.0223511 , 0.83572927, 0.84123927, 0.97012701],
       [0.11513946, 0.32536628, 0.9601556 , 0.29094538, 0.18167436,
        0.18024063, 0.39693819, 0.43211972, 0.06853266, 0.6074172 ],
       [0.60791811, 0.10812248, 0.35966756, 0.5105656 , 0.64312894,
        0.62135376, 0.15636778, 0.68191576, 0.30785389, 0.3472406 ],
       [0.09753922, 0.36049267, 0.18898294, 0.32879129, 0.9529653 ,
        0.16027317, 0.35523737, 0.62981373, 0.43252121, 0.72846181],
       [0.96939488, 0.30253614, 0.59194535, 0.86588633, 0.14817406,
        0.26925622, 0.10871438, 0.41211992, 0.83596592, 0.81172356]])

# Generate ten random integers between 20 and 50 into one-dimensional array
num_arr = np.random.randint(20,50,10)

num_arr

array([42, 20, 48, 43, 26, 45, 36, 20, 27, 48])

# Generate 20 random floats into one-dimensional array
num_arr = np.random.random(20)

num_arr

array([0.96705059, 0.30560622, 0.14575832, 0.38066329, 0.95176933,
       0.76827112, 0.50100748, 0.84181271, 0.84347954, 0.0746048 ,
       0.62008804, 0.23632383, 0.50025907, 0.4426293 , 0.83751148,
       0.21059035, 0.76733982, 0.06196653, 0.45564205, 0.98696303])

Managing outlier values#

Outlier value can be defined as a distant value from other values
We use so called Z-score to define whether the value is outlier or not
Z-score tells us how much the value differs from the standard deviation
The idea behind standard deviation is presented below (mean value is presented in the middle)

Standard deviations Source: https://en.wikiversity.org/wiki/Normal_distribution

This kind of value can become present due to following reasons:
- Mistake made during the data collection
- Greater variance in collected data
In the upcoming visualization section we will cover the scatter plot which is a great tool for spotting outlier values
In addition to visualization tools we present Z-score method here
For checking outlier values using z-score, we define the mean value standard deviation for the whole dataset first

target_values = [0.32829068, 0.91383474, 0.46572627, 1.00451282, 0.15292166, 0.15431924, 1.593784, 1.39104274, 1.37555429, 1.51761099, 1.27565507, 0.11552234, 0.38350334, 0.5234307, 1.37250195, 0.52892224, 0.71040854, 1.36889979, 1.04287721, 0.8102497, 0.22675369, 0.05733249, 0.16209903, 1.66554992, 1.00455368, 1.02013751, 0.14157334, 0.32883902, 0.48589071, 1.20919122, 1.75412208, 1.01631983, 1.02737197, 4.70497265, 0.91572539, 0.43262401, 0.05860171, 1.60976123, 0.89028701, 0.71612273, 5.0892402, 0.58976794, 1.01826693, 0.47667937, 1.24278374, 1.49998996, 1.00197125, 0.77574254, 0.50276422, 0.33406615, 0.87261939, 0.39189183, 0.91028125, 1.45755418, 1.57665023, 1.20949708, 1.11155048, 0.03730547, 1.21159563, 0.38000365, 0.87691568, 1.29910676, 1.45481045, 0.69755388, 1.25324359, 0.90087526, 1.2329156, 1.72759341, 0.27011248, 0.7556189, 0.0397281, 0.54531398, 1.530289, 0.31091256, 1.53493279, 0.28954487, 0.32347762, 1.42613795, 1.74062006, 0.70645862, 1.68256253, 0.65720741, 0.90250029, 1.50248087, 0.08352546, 1.31547595, 0.52514633, 1.45910302, 1.05784051, 0.60720322, 1.66990581, 0.65655435, 0.28681604, 0.21281041, 0.73920276, 0.45866529, 0.05322463, 1.6974098, 0.92246391, 1.24886213, 0.53357048, 0.20693448, 0.58927456, 0.78173771, 0.08512026, 0.4395528, 0.20247776, 1.38413612, 0.6866888, 0.82242901, 0.25653366, 1.77735088, 0.55459209, 0.06966439, 1.06397743, 1.29705384, 0.19517814, 1.23590563, 1.51629999, 0.77159672, 0.09963431, 0.5644203, 1.12804111, 0.83043601, 1.74946689, 0.73413819, 0.73622908, 0.40462713, 0.98179531, 0.33493697, 1.09067284, 1.02206675, 3.24059011, 1.52932619, 1.54737447, 0.03416107, 1.29228287, 0.97177852, 0.18573451, 0.4058918]

mean = np.mean(target_values)
std = np.std(target_values)
print("Mean value for the values is {} and standard deviation is {}".format(mean,std))

Mean value for the values is 0.9154030034285715 and standard deviation is 0.7219952883477874

Now since there is a lot of data present it is quite difficult to spot outlier values
Let's set a threshold value in order to spot values which are outside the first three standard deviations (99.7 % of all values)

threshold = 3
outliers = []
for i in target_values:
    if (i-mean)/std &gt; threshold:
        outliers.append(i)

print("Outlier values: {}".format(outliers))

Outlier values: [4.70497265, 5.0892402, 3.24059011]

As can be seen from the results three outlier values were found from the dataset
These could now easily be filtered out from the dataset by using the threshold value

filtered_values = [i for i in target_values if (i-mean)/std &lt;= threshold]

print(filtered_values)

[0.32829068, 0.91383474, 0.46572627, 1.00451282, 0.15292166, 0.15431924, 1.593784, 1.39104274, 1.37555429, 1.51761099, 1.27565507, 0.11552234, 0.38350334, 0.5234307, 1.37250195, 0.52892224, 0.71040854, 1.36889979, 1.04287721, 0.8102497, 0.22675369, 0.05733249, 0.16209903, 1.66554992, 1.00455368, 1.02013751, 0.14157334, 0.32883902, 0.48589071, 1.20919122, 1.75412208, 1.01631983, 1.02737197, 0.91572539, 0.43262401, 0.05860171, 1.60976123, 0.89028701, 0.71612273, 0.58976794, 1.01826693, 0.47667937, 1.24278374, 1.49998996, 1.00197125, 0.77574254, 0.50276422, 0.33406615, 0.87261939, 0.39189183, 0.91028125, 1.45755418, 1.57665023, 1.20949708, 1.11155048, 0.03730547, 1.21159563, 0.38000365, 0.87691568, 1.29910676, 1.45481045, 0.69755388, 1.25324359, 0.90087526, 1.2329156, 1.72759341, 0.27011248, 0.7556189, 0.0397281, 0.54531398, 1.530289, 0.31091256, 1.53493279, 0.28954487, 0.32347762, 1.42613795, 1.74062006, 0.70645862, 1.68256253, 0.65720741, 0.90250029, 1.50248087, 0.08352546, 1.31547595, 0.52514633, 1.45910302, 1.05784051, 0.60720322, 1.66990581, 0.65655435, 0.28681604, 0.21281041, 0.73920276, 0.45866529, 0.05322463, 1.6974098, 0.92246391, 1.24886213, 0.53357048, 0.20693448, 0.58927456, 0.78173771, 0.08512026, 0.4395528, 0.20247776, 1.38413612, 0.6866888, 0.82242901, 0.25653366, 1.77735088, 0.55459209, 0.06966439, 1.06397743, 1.29705384, 0.19517814, 1.23590563, 1.51629999, 0.77159672, 0.09963431, 0.5644203, 1.12804111, 0.83043601, 1.74946689, 0.73413819, 0.73622908, 0.40462713, 0.98179531, 0.33493697, 1.09067284, 1.02206675, 1.52932619, 1.54737447, 0.03416107, 1.29228287, 0.97177852, 0.18573451, 0.4058918]