Data visualization with Pandas#

In this section we dive in to data visualization options with Pandas library
Data visualization will give better understanding of underlying data, especially for those who don't actively work with data
Another benefit is that deviations are easier to spot from visualized data window compared to data stored in table
Pandas visualization is based on Matplotlib library

Important: installation of Matplotlib package is required when using plotting with Pandas library! In Anaconda environments this package is preinstalled in base environments.

Bar chart#

Below is an example where DataFrame containing six sensor's temperature information is visualized
Plot() method is used for visualization
The following arguments will be given:
- kind: defines the styling method
- x: sets the names for x-axis
- figsize: sets width and height for the chart area

import pandas as pd

sensordata = pd.DataFrame({
    "ID": ["S1","S2","S3","S4","S5","S6"],
    "temp": [21.2,20.3,20.9,21.3,21.0,20.8]
    })

sensordata.plot(kind="bar",x="ID",figsize=(6,5))

<AxesSubplot:xlabel='ID'>

There is also horizontal version of the bar chart called barh
Below is the visualization example for the same data as above, but this time with horizontal bars

sensordata.plot(kind="barh",x="ID",figsize=(6,5))

<AxesSubplot:ylabel='ID'>

Line chart#

Line chart is the default styling for plot method if kind argument is not provided
Below is an example where data from stocks.json file is loaded
In this example also the y-axis is defined and four different value types are presented for stocks (open, close, high and low)

stockdata = pd.read_json("stocks.json")
stockdata.plot(kind="line",x="date",y=["open","close","high","low"],figsize=(6,5))

<AxesSubplot:xlabel='date'>

Pie chart#

Pie chart is useful visualization style when you want to present a numerical proportion of items
Each slice in pie chart presents the proportion of a single item
Below is an example where proportions of eight items is illustrated
List of labels is prepared as the input which contains percentage for each item
Item list is prepared as a legend list by using legend() method for plot()

item_collection = pd.DataFrame({
    "name": ["item 1","item 2","item 3","item 4","item 5","item 6","item 7","item 8"],
    "temp": [120,132,29,23,146,92,111,18]
    })

label_list = [str(round(100 * (i / sum(item_collection["temp"])),1)) + ' %' for i in item_collection["temp"]]
item_collection.plot(kind="pie",y="temp",labels=label_list,figsize=(10,5)).legend(item_collection["name"])

<matplotlib.legend.Legend at 0x1e87488d5e0>

Area plot#

The idea in area chart is to combine both line and bar chart to visualize the progression of numeric values and comparison to other grouped values
Area chart can be presented either with overlapped or stacked model
Below is an example of a stacked model (stacked is the default value)

areadata = pd.DataFrame([[13,21,23],[12,17,20],[14,16,19]], columns=["item 1","item 2","item 3"])

areadata.plot(kind="area",figsize=(6,5))

<AxesSubplot:>

Area chart can also be presented in overlapped mode by setting the stacked parameter to False

areadata.plot(kind="area",figsize=(6,5),stacked=False)

<AxesSubplot:>

Scatter chart#

In scatter chart data is presented in dots which will be placed according to two different numeric variables
Scatter chart is great for examining relationships between variables
x and y arguments are given for defining x- and y-axis for scatter chart

dataset = pd.DataFrame({
    "acceleration": [2.2,1.3,1.8,1.5,3.2,2.8,1.3,2.2,2.0,1.9],
    "weight": [100,110,108,102,106,111,109,100,101,107],
    "name": ["a1","a2","a3","a4","a5","a6","a7","a8","a9","a10"]
})

dataset.plot(kind="scatter",x="acceleration",y="name",figsize=(6,5))

<AxesSubplot:xlabel='acceleration', ylabel='name'>

Scatter chart can be made more informative by adding a colormap to the chart when there is a third parameter available for the data
Available colormap parameter values can be found here
Let's add weight column values with c parameter and bind it to the colormap

dataset = pd.DataFrame({
    "acceleration": [2.2,1.3,1.8,1.5,3.2,2.8,1.3,2.2,2.0,1.9],
    "weight": [100,110,108,102,106,111,109,100,101,107],
    "name": ["a1","a2","a3","a4","a5","a6","a7","a8","a9","a10"]
})

dataset.plot(kind="scatter",x="acceleration",y="name",figsize=(6,5),c="weight",colormap="gist_heat")

<AxesSubplot:xlabel='acceleration', ylabel='name'>

Outlier value spotting with scatter chart#

As presented in the numpy lecture, outlier values are values that are distant (outside the standard deviation)
Scatter chart is useful for spotting outlier values

dataset = pd.DataFrame({
    "temp": [22.2,21.3,21.8,40.1,21.5,23.2,22.8,21.3,22.2,22.0,21.9,1.5,23.1,22.2,21.8,20.9,22.8],
    "hum": [100,110,108,130,102,106,111,109,100,101,107,101,102,108,107,104,106]
})

dataset.plot(kind="scatter",x="temp",y="hum",figsize=(6,5))

<AxesSubplot:xlabel='temp', ylabel='hum'>

As can be seen from the scatter chart, there are two distant points from other points