Pandas part 2#

Data grouping#

Data grouping means combining results for target columns in DataFrame
This functionality is similar to database languages (for example SQL languages)
Below is an example where the average value (mean) of two player's scores in three round of disc golf are calculated and grouped

import pandas as pd

# DataFrame creation
disc_golf_data = pd.DataFrame({
    "name": ["Mike","Mike","Mike","Steve","Steve","Steve"],
    "hole_1": [3,2,3,2,2,2],
    "hole_2": [2,2,2,3,2,2],
    "hole_3": [3,4,5,5,4,4],
    "hole_4": [3,3,5,3,3,4],
    "hole_5": [5,6,4,5,3,4]
})

# Add total score column at the end
disc_golf_data["score"] = disc_golf_data.iloc[:,1:].sum(axis=1)

# Group the data by player name and calculate the mean value
disc_golf_data.groupby("name").mean()

	hole_1	hole_2	hole_3	hole_4	hole_5	score
name
Mike	2.666667	2.000000	4.000000	3.666667	5.0	17.333333
Steve	2.000000	2.333333	4.333333	3.333333	4.0	16.000000

Combining DataFrames#

Usually we need to combine data from several sources (files, url) into DataFrames
Combining data into a single DataFrames gives us the ability to analyze it easier
concat() function can be used to add either columns or rows from one DataFrame to another
Below is an example where two DataFrames are combined together

# Initialize DataFrames
first = pd.DataFrame({
    "fruit": ["Banana","Apple","Mango","Kiwi","Pear"],
    "color": ["Yellow","Green","Orange","Brown","Green"],
    "price": [0.3,0.15,0.25,0.42,0.27]
    })

second = pd.DataFrame({
    "fruit": ["Grapefruit","Pineapple","Avocado","Pomegranate","Watermelon","Orange","Papaya"],
    "color": ["Yellow","Yellow","Green","Darkred","Green","Orange","Green"],
    "price": [0.15,1.5,0.18,0.45,1.25,0.31,1.15]
})

# Combine DataFrames
all_fruits = pd.concat([first,second])
all_fruits.reset_index(inplace=True,drop=True)
all_fruits

	fruit	color	price
0	Banana	Yellow	0.30
1	Apple	Green	0.15
2	Mango	Orange	0.25
3	Kiwi	Brown	0.42
4	Pear	Green	0.27
5	Grapefruit	Yellow	0.15
6	Pineapple	Yellow	1.50
7	Avocado	Green	0.18
8	Pomegranate	Darkred	0.45
9	Watermelon	Green	1.25
10	Orange	Orange	0.31
11	Papaya	Green	1.15

As have been learned earlier in this course, we used inplace and drop parameters for reset_index method
- inplace=True means that original DataFrame will be edited (same as reassigning it to the same variable)
- drop=True means that we won't be using the old index column as a new column (will be removed)

Missing values in DataFrame#

Sometimes DataFrames can have missing values which are marked in either of the following two ways:
- NaN (Not a Number)
- None
Pandas has some useful functions for checking these:
- isnull()
- isna()
Let's utilize these for the example data presented below

import pandas as pd

# DataFrame initialization
# We have a set of wireless sensors which collect temperature and humidity information from the building

sensors = pd.DataFrame({
    "sensor": ["s1","s2","s3","s4","s5","s6","s7","s8","s9"],
    "temperature": [21.2,22.0,None,19.9,None,20.0,20.8,21.5,20.0],
    "humidity": [60.3,None,None,59.8,60.1,59.9,59.7,60.1,59.4],
    "pressure": [833,821,None,829,845,None,823,None,820],
    "last_read": ["2021-08-10 13:31:28","2021-08-10 13:30:59","2021-08-10 13:31:17",None,None,"2021-08-10 13:31:42",None,"2021-08-10 13:30.47",None]
})

sensors

	sensor	temperature	humidity	pressure	last_read
0	s1	21.2	60.3	833.0	2021-08-10 13:31:28
1	s2	22.0	NaN	821.0	2021-08-10 13:30:59
2	s3	NaN	NaN	NaN	2021-08-10 13:31:17
3	s4	19.9	59.8	829.0	None
4	s5	NaN	60.1	845.0	None
5	s6	20.0	59.9	NaN	2021-08-10 13:31:42
6	s7	20.8	59.7	823.0	None
7	s8	21.5	60.1	NaN	2021-08-10 13:30.47
8	s9	20.0	59.4	820.0	None

# Check missing values from the DataFrame
sensors.isna()

	sensor	temperature	humidity	pressure	last_read
0	False	False	False	False	False
1	False	False	True	False	False
2	False	True	True	True	False
3	False	False	False	False	True
4	False	True	False	False	True
5	False	False	False	True	False
6	False	False	False	False	True
7	False	False	False	True	False
8	False	False	False	False	True

As can be seen from the output of isna() method, missing values are spotted with True value
Next we will examine how these missing values can be processed
Missing values can be handled with either of the following ways:
- Remove them with dropna() method
- Fill them with either fillna() or interpolate() methods

Dropping rows or columns with missing values#

dropna() has the following available attributes:
- axis: value 0 for rows, value 1 for columns (default is 0)
- how: any for dropping row/column if any of the values are missing, all for dropping row/column if all values are missing
- subset: rows/columns to look for missing values from
- inplace: True will change the original DataFrame (default is False)
Let's test this by removing rows which have at least one missing value in temperature or humidity columns if these rows have any missing values

sensors.dropna(inplace=True,subset=['temperature','humidity'])
sensors

	sensor	temperature	humidity	pressure	last_read
0	s1	21.2	60.3	833.0	2021-08-10 13:31:28
3	s4	19.9	59.8	829.0	None
5	s6	20.0	59.9	NaN	2021-08-10 13:31:42
6	s7	20.8	59.7	823.0	None
7	s8	21.5	60.1	NaN	2021-08-10 13:30.47
8	s9	20.0	59.4	820.0	None

Filling missing values#

Missing values can be filled with fillna() method
The following parameters can be utilized:
- value: input value(s) which will be used for filling NaN values
- method: this will be used with value of ffill or bfill if user won't input any values. This uses values from previous (ffill) or next cells (bfill)
- axis: value 0 for rows, value 1 for columns (default is 0)
- inplace: True will change the original DataFrame (default is False)
- limit: limit how many rows/columns at maximum will be affected
Now that we still have two missing values in last_read column, we may try to fill these with ffill method
Important: column selection will be done before fillna method since fillna does not have subset attribute

# Fill the missing datetime values in last_read column
sensors["last_read"].fillna(inplace=True,method="ffill")
sensors

	sensor	temperature	humidity	pressure	last_read
0	s1	21.2	60.3	833.0	2021-08-10 13:31:28
3	s4	19.9	59.8	829.0	2021-08-10 13:31:28
5	s6	20.0	59.9	NaN	2021-08-10 13:31:42
6	s7	20.8	59.7	823.0	2021-08-10 13:31:42
7	s8	21.5	60.1	NaN	2021-08-10 13:30.47
8	s9	20.0	59.4	820.0	2021-08-10 13:30.47

As can be seen from the output, datetime values were filled with previous values
interpolate() is another method for filling missing values
interpolate() method has several techniques that can be utilized
Here we are going to examine the linear technique
The idea behind the linear technique is that the distance to next and previous point is equal (see picture below)

Interpolate example

Interpolate method using linear technique will try to fill the missing points by following this formula (y1 - y0) / (x1 - x0)
Let's try to fill the missing pressure values using linear technique with interpolate() method

# Fill the missing pressure values
sensors["pressure"].interpolate(inplace=True,method="linear")
sensors

	sensor	temperature	humidity	pressure	last_read
0	s1	21.2	60.3	833.0	2021-08-10 13:31:28
3	s4	19.9	59.8	829.0	2021-08-10 13:31:28
5	s6	20.0	59.9	826.0	2021-08-10 13:31:42
6	s7	20.8	59.7	823.0	2021-08-10 13:31:42
7	s8	21.5	60.1	821.5	2021-08-10 13:30.47
8	s9	20.0	59.4	820.0	2021-08-10 13:30.47

As can be seen from the output, the pressure values were filled with the following way:
- sensor s6 was filled with value 826 (value between 823 and 829)
- sensor s8 was filled with value 821.5 (value between 820 and 823)