Pandas part 2#

Data grouping#

  • Data grouping means combining results for target columns in DataFrame
  • This functionality is similar to database languages (for example SQL languages)
  • Below is an example where the average value (mean) of two player's scores in three round of disc golf are calculated and grouped
import pandas as pd

# DataFrame creation
disc_golf_data = pd.DataFrame({
    "name": ["Mike","Mike","Mike","Steve","Steve","Steve"],
    "hole_1": [3,2,3,2,2,2],
    "hole_2": [2,2,2,3,2,2],
    "hole_3": [3,4,5,5,4,4],
    "hole_4": [3,3,5,3,3,4],
    "hole_5": [5,6,4,5,3,4]
})

# Add total score column at the end
disc_golf_data["score"] = disc_golf_data.iloc[:,1:].sum(axis=1)

# Group the data by player name and calculate the mean value
disc_golf_data.groupby("name").mean()
hole_1 hole_2 hole_3 hole_4 hole_5 score
name
Mike 2.666667 2.000000 4.000000 3.666667 5.0 17.333333
Steve 2.000000 2.333333 4.333333 3.333333 4.0 16.000000

Combining DataFrames#

  • Usually we need to combine data from several sources (files, url) into DataFrames
  • Combining data into a single DataFrames gives us the ability to analyze it easier
  • concat() function can be used to add either columns or rows from one DataFrame to another
  • Below is an example where two DataFrames are combined together
# Initialize DataFrames
first = pd.DataFrame({
    "fruit": ["Banana","Apple","Mango","Kiwi","Pear"],
    "color": ["Yellow","Green","Orange","Brown","Green"],
    "price": [0.3,0.15,0.25,0.42,0.27]
    })

second = pd.DataFrame({
    "fruit": ["Grapefruit","Pineapple","Avocado","Pomegranate","Watermelon","Orange","Papaya"],
    "color": ["Yellow","Yellow","Green","Darkred","Green","Orange","Green"],
    "price": [0.15,1.5,0.18,0.45,1.25,0.31,1.15]
})

# Combine DataFrames
all_fruits = pd.concat([first,second])
all_fruits.reset_index(inplace=True,drop=True)
all_fruits
fruit color price
0 Banana Yellow 0.30
1 Apple Green 0.15
2 Mango Orange 0.25
3 Kiwi Brown 0.42
4 Pear Green 0.27
5 Grapefruit Yellow 0.15
6 Pineapple Yellow 1.50
7 Avocado Green 0.18
8 Pomegranate Darkred 0.45
9 Watermelon Green 1.25
10 Orange Orange 0.31
11 Papaya Green 1.15
  • As have been learned earlier in this course, we used inplace and drop parameters for reset_index method
    • inplace=True means that original DataFrame will be edited (same as reassigning it to the same variable)
    • drop=True means that we won't be using the old index column as a new column (will be removed)

Missing values in DataFrame#

  • Sometimes DataFrames can have missing values which are marked in either of the following two ways:
    • NaN (Not a Number)
    • None
  • Pandas has some useful functions for checking these:
    • isnull()
    • isna()
  • Let's utilize these for the example data presented below
import pandas as pd

# DataFrame initialization
# We have a set of wireless sensors which collect temperature and humidity information from the building

sensors = pd.DataFrame({
    "sensor": ["s1","s2","s3","s4","s5","s6","s7","s8","s9"],
    "temperature": [21.2,22.0,None,19.9,None,20.0,20.8,21.5,20.0],
    "humidity": [60.3,None,None,59.8,60.1,59.9,59.7,60.1,59.4],
    "pressure": [833,821,None,829,845,None,823,None,820],
    "last_read": ["2021-08-10 13:31:28","2021-08-10 13:30:59","2021-08-10 13:31:17",None,None,"2021-08-10 13:31:42",None,"2021-08-10 13:30.47",None]
})

sensors
sensor temperature humidity pressure last_read
0 s1 21.2 60.3 833.0 2021-08-10 13:31:28
1 s2 22.0 NaN 821.0 2021-08-10 13:30:59
2 s3 NaN NaN NaN 2021-08-10 13:31:17
3 s4 19.9 59.8 829.0 None
4 s5 NaN 60.1 845.0 None
5 s6 20.0 59.9 NaN 2021-08-10 13:31:42
6 s7 20.8 59.7 823.0 None
7 s8 21.5 60.1 NaN 2021-08-10 13:30.47
8 s9 20.0 59.4 820.0 None
# Check missing values from the DataFrame
sensors.isna()
sensor temperature humidity pressure last_read
0 False False False False False
1 False False True False False
2 False True True True False
3 False False False False True
4 False True False False True
5 False False False True False
6 False False False False True
7 False False False True False
8 False False False False True
  • As can be seen from the output of isna() method, missing values are spotted with True value
  • Next we will examine how these missing values can be processed
  • Missing values can be handled with either of the following ways:
    • Remove them with dropna() method
    • Fill them with either fillna() or interpolate() methods

Dropping rows or columns with missing values#

  • dropna() has the following available attributes:
    • axis: value 0 for rows, value 1 for columns (default is 0)
    • how: any for dropping row/column if any of the values are missing, all for dropping row/column if all values are missing
    • subset: rows/columns to look for missing values from
    • inplace: True will change the original DataFrame (default is False)
  • Let's test this by removing rows which have at least one missing value in temperature or humidity columns if these rows have any missing values
sensors.dropna(inplace=True,subset=['temperature','humidity'])
sensors
sensor temperature humidity pressure last_read
0 s1 21.2 60.3 833.0 2021-08-10 13:31:28
3 s4 19.9 59.8 829.0 None
5 s6 20.0 59.9 NaN 2021-08-10 13:31:42
6 s7 20.8 59.7 823.0 None
7 s8 21.5 60.1 NaN 2021-08-10 13:30.47
8 s9 20.0 59.4 820.0 None

Filling missing values#

  • Missing values can be filled with fillna() method
  • The following parameters can be utilized:
    • value: input value(s) which will be used for filling NaN values
    • method: this will be used with value of ffill or bfill if user won't input any values. This uses values from previous (ffill) or next cells (bfill)
    • axis: value 0 for rows, value 1 for columns (default is 0)
    • inplace: True will change the original DataFrame (default is False)
    • limit: limit how many rows/columns at maximum will be affected
  • Now that we still have two missing values in last_read column, we may try to fill these with ffill method
  • Important: column selection will be done before fillna method since fillna does not have subset attribute
# Fill the missing datetime values in last_read column
sensors["last_read"].fillna(inplace=True,method="ffill")
sensors
sensor temperature humidity pressure last_read
0 s1 21.2 60.3 833.0 2021-08-10 13:31:28
3 s4 19.9 59.8 829.0 2021-08-10 13:31:28
5 s6 20.0 59.9 NaN 2021-08-10 13:31:42
6 s7 20.8 59.7 823.0 2021-08-10 13:31:42
7 s8 21.5 60.1 NaN 2021-08-10 13:30.47
8 s9 20.0 59.4 820.0 2021-08-10 13:30.47
  • As can be seen from the output, datetime values were filled with previous values
  • interpolate() is another method for filling missing values
  • interpolate() method has several techniques that can be utilized
  • Here we are going to examine the linear technique
  • The idea behind the linear technique is that the distance to next and previous point is equal (see picture below)

Interpolate example

  • Interpolate method using linear technique will try to fill the missing points by following this formula (y1 - y0) / (x1 - x0)
  • Let's try to fill the missing pressure values using linear technique with interpolate() method
# Fill the missing pressure values
sensors["pressure"].interpolate(inplace=True,method="linear")
sensors
sensor temperature humidity pressure last_read
0 s1 21.2 60.3 833.0 2021-08-10 13:31:28
3 s4 19.9 59.8 829.0 2021-08-10 13:31:28
5 s6 20.0 59.9 826.0 2021-08-10 13:31:42
6 s7 20.8 59.7 823.0 2021-08-10 13:31:42
7 s8 21.5 60.1 821.5 2021-08-10 13:30.47
8 s9 20.0 59.4 820.0 2021-08-10 13:30.47
  • As can be seen from the output, the pressure values were filled with the following way:
    • sensor s6 was filled with value 826 (value between 823 and 829)
    • sensor s8 was filled with value 821.5 (value between 820 and 823)