Data visualization with Pandas#
- In this section we dive in to data visualization options with Pandas library
- Data visualization will give better understanding of underlying data, especially for those who don't actively work with data
- Another benefit is that deviations are easier to spot from visualized data window compared to data stored in table
- Pandas visualization is based on Matplotlib library
Important: installation of Matplotlib package is required when using plotting with Pandas library! In Anaconda environments this package is preinstalled in base environments.
Bar chart#
- Below is an example where DataFrame containing six sensor's temperature information is visualized
- Plot() method is used for visualization
- The following arguments will be given:
- kind: defines the styling method
- x: sets the names for x-axis
- figsize: sets width and height for the chart area
import pandas as pd
sensordata = pd.DataFrame({
"ID": ["S1","S2","S3","S4","S5","S6"],
"temp": [21.2,20.3,20.9,21.3,21.0,20.8]
})
sensordata.plot(kind="bar",x="ID",figsize=(6,5))
- There is also horizontal version of the bar chart called barh
- Below is the visualization example for the same data as above, but this time with horizontal bars
sensordata.plot(kind="barh",x="ID",figsize=(6,5))
Line chart#
- Line chart is the default styling for plot method if kind argument is not provided
- Below is an example where data from stocks.json file is loaded
- In this example also the y-axis is defined and four different value types are presented for stocks (open, close, high and low)
stockdata = pd.read_json("stocks.json")
stockdata.plot(kind="line",x="date",y=["open","close","high","low"],figsize=(6,5))
Pie chart#
- Pie chart is useful visualization style when you want to present a numerical proportion of items
- Each slice in pie chart presents the proportion of a single item
- Below is an example where proportions of eight items is illustrated
- List of labels is prepared as the input which contains percentage for each item
- Item list is prepared as a legend list by using legend() method for plot()
item_collection = pd.DataFrame({
"name": ["item 1","item 2","item 3","item 4","item 5","item 6","item 7","item 8"],
"temp": [120,132,29,23,146,92,111,18]
})
label_list = [str(round(100 * (i / sum(item_collection["temp"])),1)) + ' %' for i in item_collection["temp"]]
item_collection.plot(kind="pie",y="temp",labels=label_list,figsize=(10,5)).legend(item_collection["name"])
Area plot#
- The idea in area chart is to combine both line and bar chart to visualize the progression of numeric values and comparison to other grouped values
- Area chart can be presented either with overlapped or stacked model
- Below is an example of a stacked model (stacked is the default value)
areadata = pd.DataFrame([[13,21,23],[12,17,20],[14,16,19]], columns=["item 1","item 2","item 3"])
areadata.plot(kind="area",figsize=(6,5))
- Area chart can also be presented in overlapped mode by setting the stacked parameter to False
areadata.plot(kind="area",figsize=(6,5),stacked=False)
Scatter chart#
- In scatter chart data is presented in dots which will be placed according to two different numeric variables
- Scatter chart is great for examining relationships between variables
- x and y arguments are given for defining x- and y-axis for scatter chart
dataset = pd.DataFrame({
"acceleration": [2.2,1.3,1.8,1.5,3.2,2.8,1.3,2.2,2.0,1.9],
"weight": [100,110,108,102,106,111,109,100,101,107],
"name": ["a1","a2","a3","a4","a5","a6","a7","a8","a9","a10"]
})
dataset.plot(kind="scatter",x="acceleration",y="name",figsize=(6,5))
- Scatter chart can be made more informative by adding a colormap to the chart when there is a third parameter available for the data
- Available colormap parameter values can be found here
- Let's add weight column values with c parameter and bind it to the colormap
dataset = pd.DataFrame({
"acceleration": [2.2,1.3,1.8,1.5,3.2,2.8,1.3,2.2,2.0,1.9],
"weight": [100,110,108,102,106,111,109,100,101,107],
"name": ["a1","a2","a3","a4","a5","a6","a7","a8","a9","a10"]
})
dataset.plot(kind="scatter",x="acceleration",y="name",figsize=(6,5),c="weight",colormap="gist_heat")
Outlier value spotting with scatter chart#
- As presented in the numpy lecture, outlier values are values that are distant (outside the standard deviation)
- Scatter chart is useful for spotting outlier values
dataset = pd.DataFrame({
"temp": [22.2,21.3,21.8,40.1,21.5,23.2,22.8,21.3,22.2,22.0,21.9,1.5,23.1,22.2,21.8,20.9,22.8],
"hum": [100,110,108,130,102,106,111,109,100,101,107,101,102,108,107,104,106]
})
dataset.plot(kind="scatter",x="temp",y="hum",figsize=(6,5))
- As can be seen from the scatter chart, there are two distant points from other points