Pandas part 2#
Data grouping#
- Data grouping means combining results for target columns in DataFrame
- This functionality is similar to database languages (for example SQL languages)
- Below is an example where the average value (mean) of two player's scores in three round of disc golf are calculated and grouped
import pandas as pd
# DataFrame creation
disc_golf_data = pd.DataFrame({
"name": ["Mike","Mike","Mike","Steve","Steve","Steve"],
"hole_1": [3,2,3,2,2,2],
"hole_2": [2,2,2,3,2,2],
"hole_3": [3,4,5,5,4,4],
"hole_4": [3,3,5,3,3,4],
"hole_5": [5,6,4,5,3,4]
})
# Add total score column at the end
disc_golf_data["score"] = disc_golf_data.iloc[:,1:].sum(axis=1)
# Group the data by player name and calculate the mean value
disc_golf_data.groupby("name").mean()
Combining DataFrames#
- Usually we need to combine data from several sources (files, url) into DataFrames
- Combining data into a single DataFrames gives us the ability to analyze it easier
- concat() function can be used to add either columns or rows from one DataFrame to another
- Below is an example where two DataFrames are combined together
# Initialize DataFrames
first = pd.DataFrame({
"fruit": ["Banana","Apple","Mango","Kiwi","Pear"],
"color": ["Yellow","Green","Orange","Brown","Green"],
"price": [0.3,0.15,0.25,0.42,0.27]
})
second = pd.DataFrame({
"fruit": ["Grapefruit","Pineapple","Avocado","Pomegranate","Watermelon","Orange","Papaya"],
"color": ["Yellow","Yellow","Green","Darkred","Green","Orange","Green"],
"price": [0.15,1.5,0.18,0.45,1.25,0.31,1.15]
})
# Combine DataFrames
all_fruits = pd.concat([first,second])
all_fruits.reset_index(inplace=True,drop=True)
all_fruits
- As have been learned earlier in this course, we used inplace and drop parameters for reset_index method
- inplace=True means that original DataFrame will be edited (same as reassigning it to the same variable)
- drop=True means that we won't be using the old index column as a new column (will be removed)
Missing values in DataFrame#
- Sometimes DataFrames can have missing values which are marked in either of the following two ways:
- NaN (Not a Number)
- None
- Pandas has some useful functions for checking these:
- isnull()
- isna()
- Let's utilize these for the example data presented below
import pandas as pd
# DataFrame initialization
# We have a set of wireless sensors which collect temperature and humidity information from the building
sensors = pd.DataFrame({
"sensor": ["s1","s2","s3","s4","s5","s6","s7","s8","s9"],
"temperature": [21.2,22.0,None,19.9,None,20.0,20.8,21.5,20.0],
"humidity": [60.3,None,None,59.8,60.1,59.9,59.7,60.1,59.4],
"pressure": [833,821,None,829,845,None,823,None,820],
"last_read": ["2021-08-10 13:31:28","2021-08-10 13:30:59","2021-08-10 13:31:17",None,None,"2021-08-10 13:31:42",None,"2021-08-10 13:30.47",None]
})
sensors
# Check missing values from the DataFrame
sensors.isna()
- As can be seen from the output of isna() method, missing values are spotted with True value
- Next we will examine how these missing values can be processed
- Missing values can be handled with either of the following ways:
- Remove them with dropna() method
- Fill them with either fillna() or interpolate() methods
Dropping rows or columns with missing values#
- dropna() has the following available attributes:
- axis: value 0 for rows, value 1 for columns (default is 0)
- how: any for dropping row/column if any of the values are missing, all for dropping row/column if all values are missing
- subset: rows/columns to look for missing values from
- inplace: True will change the original DataFrame (default is False)
- Let's test this by removing rows which have at least one missing value in temperature or humidity columns if these rows have any missing values
sensors.dropna(inplace=True,subset=['temperature','humidity'])
sensors
Filling missing values#
- Missing values can be filled with fillna() method
- The following parameters can be utilized:
- value: input value(s) which will be used for filling NaN values
- method: this will be used with value of ffill or bfill if user won't input any values. This uses values from previous (ffill) or next cells (bfill)
- axis: value 0 for rows, value 1 for columns (default is 0)
- inplace: True will change the original DataFrame (default is False)
- limit: limit how many rows/columns at maximum will be affected
- Now that we still have two missing values in last_read column, we may try to fill these with ffill method
- Important: column selection will be done before fillna method since fillna does not have subset attribute
# Fill the missing datetime values in last_read column
sensors["last_read"].fillna(inplace=True,method="ffill")
sensors
- As can be seen from the output, datetime values were filled with previous values
- interpolate() is another method for filling missing values
- interpolate() method has several techniques that can be utilized
- Here we are going to examine the linear technique
- The idea behind the linear technique is that the distance to next and previous point is equal (see picture below)
- Interpolate method using linear technique will try to fill the missing points by following this formula (y1 - y0) / (x1 - x0)
- Let's try to fill the missing pressure values using linear technique with interpolate() method
# Fill the missing pressure values
sensors["pressure"].interpolate(inplace=True,method="linear")
sensors
- As can be seen from the output, the pressure values were filled with the following way:
- sensor s6 was filled with value 826 (value between 823 and 829)
- sensor s8 was filled with value 821.5 (value between 820 and 823)