Pandas part 1#
- Pandas is a Python library used for data analytics
- Pandas introduces a new data structure called DataFrame
- DataFrame structure is tabular similar to SQL for example so it has rows and columns
- Below is an example where employee data is stored in dictionary and this data is then converted to DataFrame
import pandas as pd
dict = [
{"firstname":"Jennifer","lastname":"Davis","phone":"0613512441","email":"jennifer.davis@company.com"},
{"firstname":"Mike","lastname":"Johnson","phone":"0513643123","email":"mike.johnson@company.com"},
{"firstname":"Steve","lastname":"Mclaud","phone":"0561253343","email":"steve.mclaud@company.com"},
{"firstname":"Lucy","lastname":"Brown","phone":"0613412353","email":"lucy.brown@company.com"},
{"firstname":"Gary","lastname":"Stone","phone":"0612343521","email":"gary.stone@company.com"}
]
employees = pd.DataFrame(dict)
employees
- As can be seen from the example above each row will be given a unique key value starting from number 0
- User defined values can also be used as index values
- Let's use employee ID numbers as index numbers for the existing DataFrame
employees.index = ["JD111","MJ222","SM333","LB444","GS555"]
employees
Creating DataFrame from imported JSON file#
- Like we showed above, reading CSV files is quick and easy with pandas
- Same goes also with JSON files when we read JSON data with read_json() function
stockdata = pd.read_json("stocks.json")
stockdata.head()
Creating DataFrame from imported CSV file#
- Earlier in this course we learned how CSV file can be read with using either Python or Pandas
- Next we will read the cars.csv file using read_csv() function from Pandas library and convert it to DataFrame
import pandas as pd
car_data = pd.read_csv("cars.csv",delimiter=";")
car_data
Printing n rows of DataFrame#
- head() and tail() methods can be used to print first or last n rows
- The default value is 5 if no value is provided with these methods
# Show the first five rows
car_data.head()
# Show the last three rows
car_data.tail(3)
- While head() and tail() provide the method of presenting n rows from the beginning and the end of the DataFrame, slicing allows more flexible presentation of rows
- Slicing is done using [] operator
- This is similar to slicing string or list like shown previously in Python fundamentals
# Show the rows from 3 to 8 (inclusive) of car data
car_data[3:9]
# Show the last five rows of car data in reverse order
# [-5:] chooses the rows starting from the fifth last row till the end and [::-1] reverses the output
car_data[-5:][::-1]
Selecting data from the DataFrame#
- When we want to select certain data from the DataFrame, loc and iloc are great tools
- loc means that rows and columns are selected based on their labels (loc[row_label,column_label])
- iloc means that rows and columns are selected based on their index numbers (iloc[row_position,column_position])
- Let's go through a couple of examples with the car data
# Select Horsepower information of Ford Torino using .loc
print(car_data.loc[13,"Horsepower"])
# Select Horsepower information of Ford Torino using .iloc
print(car_data.iloc[13,4])
- Like shown in the example above we could retrieve the data from the single cell
- If we would like to select a greater portion of the data or all of the data we could utilize slicing technique with .loc and iloc
# Select all car data from the car DataFrame
car_data.loc[:,:]
- In order to select multiple columns from the DataFrame, columns can be given in a list
# Select car, cylinder, displacement and weight information and show six rows starting from index number 1
car_data.loc[1:6,["Car","Cylinders","Displacement","Weight"]]
Sorting data by column#
- Data sorting is done with sort_values() method
- This method takes the target column or columns inside a list and the sort direction (ascending or descending) as an input
- Below is an example where car data is sorted by weight in descending order
car_data.sort_values(["Weight"], ascending=False)
- Sorting can be done to multiple columns
- Direction of sort can be set to each sorted column
- In the example below car data is first sorted by MPG in descending order and then with Weight column in ascending order
car_data.sort_values(["MPG","Weight"], ascending=[False,True])
- As can be seen from the output, data is sorted primarily with MPG column and each row with equal MPG value will then be sorted with Weight value in ascending order
Row deletion#
- DataFrame rows could be removed with drop() method
- By default, when removing rows or columns from the DataFrame, new DataFrame is always created and changes are not made to the original
- If we want to make changes to the original, we should use inplace=True definition with drop() method
- Using inplace=True is the same as reassigning the DataFrame to the same variable again (overwrites the existing variable content)
- Car data still has extra row with row index number of 0
- Let's remove this row and save the modified DataFrame
# Remove the first row (index number 0) from the DataFrame and save the changes for the new DataFrame
# This would be the same as car_data = car_data.drop(0)
car_data.drop(0,inplace=True)
car_data
- Multiple rows can be deleted with one drop method by presenting the index labels or numbers in a list inside the drop() method
# Drop rows with index number 6, 8 and 12
car_data.drop([6,8,12],inplace=True)
car_data
- Now we see that dropped index numbers are missing
- reset_index() method can be used to correct index numbering
- We use drop=True here so the old index won't be added as a new column
car_data.reset_index(inplace=True,drop=True)
car_data
Adding a new column to DataFrame#
- There are many ways to insert new column to existing DataFrame:
- assign method (adds new column to the end of the DataFrame)
- insert method (adds new column to the DataFrame using the position of choice)
- with column name as a reference
- loc with column name as label
- Below are couple of examples for adding a new column
- In first example we add color column for cars using the assign method
- This method takes column name and value or valueset as an input
# Add color column for car DataFrame using assign method
colors = ["red","lightblue","green","red","blue","yellow","green","red","red","black"]
car_data = car_data.assign(Color=colors)
car_data
- In second example we add rating column for cars using insert method
- The benefit of insert method compared to assign is that you may freely choose the position of the new column
- Insert method takes four input values:
- position
- name
- values for the column
- check if column with the same name already exists in the DataFrame (True / False)
- In the example below we add our rating column to column index position 5 (sixth column in DataFrame)
# Add rating column for car DataFrame using loc
ratings = [4.1,3.8,4.2,4.0,3.9,3.9,4.4,4.2,4.1,3.7]
car_data.insert(5,"Rating",ratings,True)
car_data
Column deletion#
- drop() method can be used for removing columns also from the DataFrame
- drop() method takes axis parameter with either value of 0 or 1
- axis=0 (default value) removes rows
- axis=1 removes columns
- Let's remove the Model column from the DataFrame
car_data.drop(["Model"],inplace=True,axis=1)
car_data
Conversion between datatypes#
- Like in basic Python, with Pandas we sometimes need to convert from datatype to another
- Below is a table containing the possible datatypes in Pandas
Pandas dtype | Python type | Usage |
---|---|---|
object | str or mixed | Text or mixed numeric and non-numeric values |
int64 | int | Whole numbers |
float64 | float | Floating point numbers |
bool | bool | Boolean values (True/False) |
datetime64 | datetime | Date and time value |
timedelta | N/A | Difference between two datetime values |
category | N/A | Finite list of text values |
- Current datatypes used in DataFrame can be retrieved with dtypes method
# List datatypes for DataFrame
car_data.dtypes
- As can be seen from the output, we currently have all columns marked as objects
- In order to ease our future data filtering, let's set our datatypes to match the column values
- Important: String values will be treated as objects
car_data = car_data.astype({"Car":"string","MPG":"float","Cylinders":"int","Displacement":"float","Horsepower":"float","Weight":"float","Acceleration":"float","Origin":"string"})
car_data.dtypes
Data filtering based on values#
- With Pandas library we can also filter data using fundamentals we have acquired earlier in this course with Python
- Below is an example where car data is further filtered so that it matches the following criteria:
- Acceleration value must be 12.0 or less
- Horsepower value should be between 100 and 250 (inclusive)
- Car name should start with letter C
car_data = car_data[(car_data["Acceleration"] <= 12.0) & ((car_data["Horsepower"] >= 100) & (car_data["Horsepower"] <= 250)) & (car_data["Car"].str.startswith("C"))]
- Let's analyze what we did here
- All three conditions were combined with & (AND)
- Car name in Car column is still handled as series so we use str method in order to use string methods from Python for series data
- Changes are stored in current DataFrame by reassigning the variable
Writing DataFrame to CSV file#
- DataFrame can be written back to CSV file using .to_csv() method for DataFrame
- Below is an example where current modified DataFrame containing information of two cars is written into the csv file called cars-filtered.csv
car_data.to_csv("cars-filtered.csv")
- Finally, let's read the created file in order to verify that it includes the correct data
new_car_data = pd.read_csv("cars-filtered.csv")
new_car_data