NumPy#

  • NumPy stands for Numerical Python
  • It is a Python library with tools made for working with multidimensional array objects
  • Important: NumPy array is different than Python list and Standard Python array!
  • Python array can handle only one dimensional arrays with much less functionality
  • In this lecture we are going through the following topics related to NumPy:
    • NumPy arrays
    • Array indexing
    • Array slicing
    • Array shape
  • Before we dive into these topics, it is essential to examine how NumPy array differs from regular lists in Python
    • Python lists are resizable and one list can contain different types of elements
    • NumPy arrays have higher performance and take up less space than Python lists
    • NumPy arrays reserve sequential memory slots in memory where as Python lists can be spread around memory area (random positions in memory)
    • In addition, NumPy arrays have built-in optimized functions (for example linear algebra)

NumPy array vs Python list performance#

  • Let's test the performance between NumPy array and Python list
  • Below is an example where performance test is run for both with 1000000 elements
import time
import numpy as np

size_of_vec = 1000000

def python_list():
    t1 = time.time()
    X = range(size_of_vec)
    Y = range(size_of_vec)
    Z = [X[i] + Y[i] for i in range(len(X))]
    return time.time() - t1

def numpy_array():
    t1 = time.time()
    X = np.arange(size_of_vec)
    Y = np.arange(size_of_vec)
    Z = X + Y
    return time.time() - t1


count1 = python_list()
count2 = numpy_array()
print("****** Test run with {} elements ******".format(size_of_vec))
print("Python list: {}".format(count1))
print("NumPy array: {}".format(count2))
print("NumPy array is " + str(count1/count2) + " faster!")
****** Test run with 1000000 elements ******
Python list: 0.25601911544799805
NumPy array: 0.0039997100830078125
NumPy array is 64.00941821649977 faster!

  • As can be seen from the results, NumPy array outperformes regular Python list easily

NymPy array vs Python list space consumption#

  • Earlier we proved that NumPy array performance is significantly higher compared to Python list
  • In this section we check how does these compare in size they reserve in memory
  • Below we have an illustration of memory consumption principles of NymPy array and Python list
  • As mentioned earlier, array memory reservation is basically sequential memory slots for the length of array

NumPy array memory consumption

  • Since Python list offers the flexibility to save elements of different types (int, float, str...), it also consumes more memory
  • Important: You may also save different types of elements to NumPy array, but then it takes also much more space!
  • List object refers to a memory address which again contains all the required information for list object (data, type, reference count)
  • Reference count tells how many times this object's memory address have been referenced
  • Each new list object requires 8 bytes more memory

Python list memory consumption

import sys

arr = range(1000000)

# Python list

# Size of single element
print("Size of single element in Python list: {} bytes".format(sys.getsizeof(arr)))

# Size of whole list
print("Size of whole Python list: {} bytes".format(sys.getsizeof(arr)*len(arr)))


# NumPy array

numpy_arr = np.arange(1000000)

# Size of single element
print("Size of single element in NumPy array: {} bytes".format(numpy_arr.itemsize))

# Size of whole NumPy array
print("Size of whole NumPy array: {} bytes".format(numpy_arr.size * numpy_arr.itemsize))
Size of single element in Python list: 48 bytes
Size of whole Python list: 48000000 bytes
Size of single element in NumPy array: 4 bytes
Size of whole NumPy array: 4000000 bytes

NumPy array basics#

  • NumPy array data type is ndarray
  • The following attributes of NumPy array are discussed in this section:
    • dtype
    • size
    • itemsize
    • shape
    • ndim
  • Below is an example where we define a simple one-dimensional NumPy array
arr = np.array([1,2,3,4,5,6,7,8,9,10])
arr
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])
  • We can also utilize the arange routine to create same array as above with evenly spaced values
  • The following are the useful parameters for arange:
    • start: start of the interval (default 0)
    • stop: stop of the interval
    • step: spacing between values (default 1)
  • Parameter amount acts as following:
    • 1 parameter = stop
    • 2 parameters = start and stop
    • 3 parameters = start, stop and step
  • Basically we use arange routine by giving it the starting value (1) and the ending value, which will not be included (11)
arr2 = np.arange(1,11)
arr2
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

Data type (dtype)#

  • Now let's check the data type for the array and elements stored in the array
# Data type for the array
print(type(arr))

# Data type for the array elements
print(arr.dtype)
<class 'numpy.ndarray'>
int32

  • As can be seen from the example above, array type is ndarray and it contains elements with int32 (32-bit signed integer)

Size of array and its elements#

  • For examining the size measures of array and its elements we use
    • size: get the size of NumPy array (the amount of elements in array)
    • itemsize: get the size of single element in NumPy array
  • Next we create a simple two-dimensional array with arange routine and get the measures
arr3 = np.arange(25)
arr3
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24])
print("Size of the array: {} items".format(arr3.size))
print("Size of a single element: {} bytes".format(arr3.itemsize))
print("Size of the array in bytes: {} bytes".format(arr3.size * arr3.itemsize))
Size of the array: 25 items
Size of a single element: 4 bytes
Size of the array in bytes: 100 bytes

Shape#

  • In NumPy array shape presents the number of elements in each dimension
  • In this section we describe the following:
    • shape: retrieve the array dimensions (returns the dimensions in the following format: x,y where x presents the rows and y the columns)
    • reshape: manipulate the dimensions of existing array (takes two parameters: x,y where x is the row count and y is the column count)
  • Below is an example where we first create a one-dimensional array containing 15 elements and then transform it into two-dimensional array
# One-dimensional array
numbers = np.arange(15)
numbers
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])
  • Now let's return the dimension information
numbers.shape
(15,)
  • As can be seen from the output, there is only one row and all elements are contained within this row
  • Let's then do the transformation into two dimensional array with three rows and five columns
# Two-dimensional array (three rows, five columns)
numbers_two = numbers.reshape(3,5)
numbers_two
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])
  • Now we can recheck the shape information
numbers_two.shape
(3, 5)

Number of array dimensions (ndim)#

  • Number of dimensions basically describes whether the NumPy array has one, two, three or n dimensions
  • In order to retrieve this information we need to use ndim like shown in the example below
# Let's build a two-dimensional array (25 elements in a shape of 5 rows and 5 columns)
new_arr = np.arange(25).reshape(5,5)
new_arr
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])
new_arr.ndim
2

Mathematical operations#

  • Mathematical operators work in a completely different way with NumPy compared to regular Python lists
  • Below is an example for using a couple of mathematical operators for both
p1 = [1,2,3,4,5]
p2 = [6,7,8,9,10]

result_arr = p1 + p2
result_arr
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
  • Addition of two lists works like append method for lists
  • Now let's run the same for NumPy arrays
n1 = np.array([1,2,3,4,5])
n2 = np.array([6,7,8,9,10])

result_arr = n1 + n2
result_arr
array([ 7,  9, 11, 13, 15])
  • As can be seen from the output each element in corresponding index positions were added up
  • Important: If both arrays have more than one element, NumPy requires that you are adding the same amount elements or you would receive an error like shown below
n3 = np.array([6,7,8,9])

result_arr = n1 + n3
result_arr
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-38-abec1e02326e> in <module>
      1 n3 = np.array([6,7,8,9])
      2 
----> 3 result_arr = n1 + n3
      4 result_arr

ValueError: operands could not be broadcast together with shapes (5,) (4,) 
  • Mathematical operations could easily be targeted to array elements
  • Below is an example where we multiply each array element by 3
result_arr = n1*3
result_arr
array([ 3,  6,  9, 12, 15])

Array conditions#

  • Comparison operators can be used to retrieve either boolean information (True, False) of each element or the elements itself
  • In the example below we check how many elements are 10 or greater
new_arr = np.array([1,4,6,9,11,12,15,16,17,21,22,27,30])

# Get boolean results
result = new_arr &gt;= 10
result
array([False, False, False, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True])
# Get values as a result
result = new_arr[new_arr &gt;= 10]
result
array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24])
  • When there is a need to filter results with multiple conditions, where should be used
  • Below is an example where all elements between 10 and 20 (inclusive) are filtered to result set
query = np.where((new_arr &gt;= 10) &amp; (new_arr &lt;= 20))
result = new_arr[query]
result
array([11, 12, 15, 16, 17])

Random numbers with NumPy#

  • Random number can be defined as a number which value cannot be predicted
  • Random number will be generated usually from predefined range, for example between 10 and 100
  • Python has several ways to generate random numbers, but in this section we focus on generating random numbers with NumPy library
  • Random number generation can be done by using the following functions from NumPy library (we only cover a couple of them):
    • rand(d0,d1,...dn): returns random floats generated to a given shape (dimensions)
    • randint(low,high,size,dtype): returns random integers between low (inclusive) and high (exclusive)
    • random(size): returns random floats
# Generate random floats for five rows and ten columns (two-dimensional array)
num_arr = np.random.rand(5,10)

num_arr
array([[0.455673  , 0.34983164, 0.63719532, 0.62558773, 0.61995063,
        0.51437659, 0.0223511 , 0.83572927, 0.84123927, 0.97012701],
       [0.11513946, 0.32536628, 0.9601556 , 0.29094538, 0.18167436,
        0.18024063, 0.39693819, 0.43211972, 0.06853266, 0.6074172 ],
       [0.60791811, 0.10812248, 0.35966756, 0.5105656 , 0.64312894,
        0.62135376, 0.15636778, 0.68191576, 0.30785389, 0.3472406 ],
       [0.09753922, 0.36049267, 0.18898294, 0.32879129, 0.9529653 ,
        0.16027317, 0.35523737, 0.62981373, 0.43252121, 0.72846181],
       [0.96939488, 0.30253614, 0.59194535, 0.86588633, 0.14817406,
        0.26925622, 0.10871438, 0.41211992, 0.83596592, 0.81172356]])
# Generate ten random integers between 20 and 50 into one-dimensional array
num_arr = np.random.randint(20,50,10)

num_arr
array([42, 20, 48, 43, 26, 45, 36, 20, 27, 48])
# Generate 20 random floats into one-dimensional array
num_arr = np.random.random(20)

num_arr
array([0.96705059, 0.30560622, 0.14575832, 0.38066329, 0.95176933,
       0.76827112, 0.50100748, 0.84181271, 0.84347954, 0.0746048 ,
       0.62008804, 0.23632383, 0.50025907, 0.4426293 , 0.83751148,
       0.21059035, 0.76733982, 0.06196653, 0.45564205, 0.98696303])

Managing outlier values#

  • Outlier value can be defined as a distant value from other values
  • We use so called Z-score to define whether the value is outlier or not
  • Z-score tells us how much the value differs from the standard deviation
  • The idea behind standard deviation is presented below (mean value is presented in the middle)

Standard deviations Source: https://en.wikiversity.org/wiki/Normal_distribution

  • This kind of value can become present due to following reasons:
    • Mistake made during the data collection
    • Greater variance in collected data
  • In the upcoming visualization section we will cover the scatter plot which is a great tool for spotting outlier values
  • In addition to visualization tools we present Z-score method here
  • For checking outlier values using z-score, we define the mean value standard deviation for the whole dataset first
target_values = [0.32829068, 0.91383474, 0.46572627, 1.00451282, 0.15292166, 0.15431924, 1.593784, 1.39104274, 1.37555429, 1.51761099, 1.27565507, 0.11552234, 0.38350334, 0.5234307, 1.37250195, 0.52892224, 0.71040854, 1.36889979, 1.04287721, 0.8102497, 0.22675369, 0.05733249, 0.16209903, 1.66554992, 1.00455368, 1.02013751, 0.14157334, 0.32883902, 0.48589071, 1.20919122, 1.75412208, 1.01631983, 1.02737197, 4.70497265, 0.91572539, 0.43262401, 0.05860171, 1.60976123, 0.89028701, 0.71612273, 5.0892402, 0.58976794, 1.01826693, 0.47667937, 1.24278374, 1.49998996, 1.00197125, 0.77574254, 0.50276422, 0.33406615, 0.87261939, 0.39189183, 0.91028125, 1.45755418, 1.57665023, 1.20949708, 1.11155048, 0.03730547, 1.21159563, 0.38000365, 0.87691568, 1.29910676, 1.45481045, 0.69755388, 1.25324359, 0.90087526, 1.2329156, 1.72759341, 0.27011248, 0.7556189, 0.0397281, 0.54531398, 1.530289, 0.31091256, 1.53493279, 0.28954487, 0.32347762, 1.42613795, 1.74062006, 0.70645862, 1.68256253, 0.65720741, 0.90250029, 1.50248087, 0.08352546, 1.31547595, 0.52514633, 1.45910302, 1.05784051, 0.60720322, 1.66990581, 0.65655435, 0.28681604, 0.21281041, 0.73920276, 0.45866529, 0.05322463, 1.6974098, 0.92246391, 1.24886213, 0.53357048, 0.20693448, 0.58927456, 0.78173771, 0.08512026, 0.4395528, 0.20247776, 1.38413612, 0.6866888, 0.82242901, 0.25653366, 1.77735088, 0.55459209, 0.06966439, 1.06397743, 1.29705384, 0.19517814, 1.23590563, 1.51629999, 0.77159672, 0.09963431, 0.5644203, 1.12804111, 0.83043601, 1.74946689, 0.73413819, 0.73622908, 0.40462713, 0.98179531, 0.33493697, 1.09067284, 1.02206675, 3.24059011, 1.52932619, 1.54737447, 0.03416107, 1.29228287, 0.97177852, 0.18573451, 0.4058918]

mean = np.mean(target_values)
std = np.std(target_values)
print("Mean value for the values is {} and standard deviation is {}".format(mean,std))
Mean value for the values is 0.9154030034285715 and standard deviation is 0.7219952883477874

  • Now since there is a lot of data present it is quite difficult to spot outlier values
  • Let's set a threshold value in order to spot values which are outside the first three standard deviations (99.7 % of all values)
threshold = 3
outliers = []
for i in target_values:
    if (i-mean)/std &gt; threshold:
        outliers.append(i)

print("Outlier values: {}".format(outliers))
Outlier values: [4.70497265, 5.0892402, 3.24059011]

  • As can be seen from the results three outlier values were found from the dataset
  • These could now easily be filtered out from the dataset by using the threshold value
filtered_values = [i for i in target_values if (i-mean)/std &lt;= threshold]

print(filtered_values)
[0.32829068, 0.91383474, 0.46572627, 1.00451282, 0.15292166, 0.15431924, 1.593784, 1.39104274, 1.37555429, 1.51761099, 1.27565507, 0.11552234, 0.38350334, 0.5234307, 1.37250195, 0.52892224, 0.71040854, 1.36889979, 1.04287721, 0.8102497, 0.22675369, 0.05733249, 0.16209903, 1.66554992, 1.00455368, 1.02013751, 0.14157334, 0.32883902, 0.48589071, 1.20919122, 1.75412208, 1.01631983, 1.02737197, 0.91572539, 0.43262401, 0.05860171, 1.60976123, 0.89028701, 0.71612273, 0.58976794, 1.01826693, 0.47667937, 1.24278374, 1.49998996, 1.00197125, 0.77574254, 0.50276422, 0.33406615, 0.87261939, 0.39189183, 0.91028125, 1.45755418, 1.57665023, 1.20949708, 1.11155048, 0.03730547, 1.21159563, 0.38000365, 0.87691568, 1.29910676, 1.45481045, 0.69755388, 1.25324359, 0.90087526, 1.2329156, 1.72759341, 0.27011248, 0.7556189, 0.0397281, 0.54531398, 1.530289, 0.31091256, 1.53493279, 0.28954487, 0.32347762, 1.42613795, 1.74062006, 0.70645862, 1.68256253, 0.65720741, 0.90250029, 1.50248087, 0.08352546, 1.31547595, 0.52514633, 1.45910302, 1.05784051, 0.60720322, 1.66990581, 0.65655435, 0.28681604, 0.21281041, 0.73920276, 0.45866529, 0.05322463, 1.6974098, 0.92246391, 1.24886213, 0.53357048, 0.20693448, 0.58927456, 0.78173771, 0.08512026, 0.4395528, 0.20247776, 1.38413612, 0.6866888, 0.82242901, 0.25653366, 1.77735088, 0.55459209, 0.06966439, 1.06397743, 1.29705384, 0.19517814, 1.23590563, 1.51629999, 0.77159672, 0.09963431, 0.5644203, 1.12804111, 0.83043601, 1.74946689, 0.73413819, 0.73622908, 0.40462713, 0.98179531, 0.33493697, 1.09067284, 1.02206675, 1.52932619, 1.54737447, 0.03416107, 1.29228287, 0.97177852, 0.18573451, 0.4058918]