Data visualization with Pandas#

  • In this section we dive in to data visualization options with Pandas library
  • Data visualization will give better understanding of underlying data, especially for those who don't actively work with data
  • Another benefit is that deviations are easier to spot from visualized data window compared to data stored in table
  • Pandas visualization is based on Matplotlib library
Important: installation of Matplotlib package is required when using plotting with Pandas library! In Anaconda environments this package is preinstalled in base environments.

Bar chart#

  • Below is an example where DataFrame containing six sensor's temperature information is visualized
  • Plot() method is used for visualization
  • The following arguments will be given:
    • kind: defines the styling method
    • x: sets the names for x-axis
    • figsize: sets width and height for the chart area
import pandas as pd

sensordata = pd.DataFrame({
    "ID": ["S1","S2","S3","S4","S5","S6"],
    "temp": [21.2,20.3,20.9,21.3,21.0,20.8]
    })

sensordata.plot(kind="bar",x="ID",figsize=(6,5))
<AxesSubplot:xlabel='ID'>
2021-06-11T09:37:50.930595 image/svg+xml Matplotlib v3.3.2, https://matplotlib.org/
  • There is also horizontal version of the bar chart called barh
  • Below is the visualization example for the same data as above, but this time with horizontal bars
sensordata.plot(kind="barh",x="ID",figsize=(6,5))
<AxesSubplot:ylabel='ID'>
2021-06-11T09:37:51.250623 image/svg+xml Matplotlib v3.3.2, https://matplotlib.org/

Line chart#

  • Line chart is the default styling for plot method if kind argument is not provided
  • Below is an example where data from stocks.json file is loaded
  • In this example also the y-axis is defined and four different value types are presented for stocks (open, close, high and low)
stockdata = pd.read_json("stocks.json")
stockdata.plot(kind="line",x="date",y=["open","close","high","low"],figsize=(6,5))
<AxesSubplot:xlabel='date'>
2021-06-11T09:37:51.598212 image/svg+xml Matplotlib v3.3.2, https://matplotlib.org/

Pie chart#

  • Pie chart is useful visualization style when you want to present a numerical proportion of items
  • Each slice in pie chart presents the proportion of a single item
  • Below is an example where proportions of eight items is illustrated
  • List of labels is prepared as the input which contains percentage for each item
  • Item list is prepared as a legend list by using legend() method for plot()
item_collection = pd.DataFrame({
    "name": ["item 1","item 2","item 3","item 4","item 5","item 6","item 7","item 8"],
    "temp": [120,132,29,23,146,92,111,18]
    })

label_list = [str(round(100 * (i / sum(item_collection["temp"])),1)) + ' %' for i in item_collection["temp"]]
item_collection.plot(kind="pie",y="temp",labels=label_list,figsize=(10,5)).legend(item_collection["name"])
<matplotlib.legend.Legend at 0x1e87488d5e0>
2021-06-11T09:37:52.050241 image/svg+xml Matplotlib v3.3.2, https://matplotlib.org/

Area plot#

  • The idea in area chart is to combine both line and bar chart to visualize the progression of numeric values and comparison to other grouped values
  • Area chart can be presented either with overlapped or stacked model
  • Below is an example of a stacked model (stacked is the default value)
areadata = pd.DataFrame([[13,21,23],[12,17,20],[14,16,19]], columns=["item 1","item 2","item 3"])

areadata.plot(kind="area",figsize=(6,5))
<AxesSubplot:>
2021-06-11T09:37:52.466215 image/svg+xml Matplotlib v3.3.2, https://matplotlib.org/
  • Area chart can also be presented in overlapped mode by setting the stacked parameter to False
areadata.plot(kind="area",figsize=(6,5),stacked=False)
<AxesSubplot:>
2021-06-11T09:37:52.806212 image/svg+xml Matplotlib v3.3.2, https://matplotlib.org/

Scatter chart#

  • In scatter chart data is presented in dots which will be placed according to two different numeric variables
  • Scatter chart is great for examining relationships between variables
  • x and y arguments are given for defining x- and y-axis for scatter chart
dataset = pd.DataFrame({
    "acceleration": [2.2,1.3,1.8,1.5,3.2,2.8,1.3,2.2,2.0,1.9],
    "weight": [100,110,108,102,106,111,109,100,101,107],
    "name": ["a1","a2","a3","a4","a5","a6","a7","a8","a9","a10"]
})

dataset.plot(kind="scatter",x="acceleration",y="name",figsize=(6,5))
<AxesSubplot:xlabel='acceleration', ylabel='name'>
2021-06-11T09:37:53.125241 image/svg+xml Matplotlib v3.3.2, https://matplotlib.org/
  • Scatter chart can be made more informative by adding a colormap to the chart when there is a third parameter available for the data
  • Available colormap parameter values can be found here
  • Let's add weight column values with c parameter and bind it to the colormap
dataset = pd.DataFrame({
    "acceleration": [2.2,1.3,1.8,1.5,3.2,2.8,1.3,2.2,2.0,1.9],
    "weight": [100,110,108,102,106,111,109,100,101,107],
    "name": ["a1","a2","a3","a4","a5","a6","a7","a8","a9","a10"]
})

dataset.plot(kind="scatter",x="acceleration",y="name",figsize=(6,5),c="weight",colormap="gist_heat")
<AxesSubplot:xlabel='acceleration', ylabel='name'>
2021-06-11T09:37:53.462211 image/svg+xml Matplotlib v3.3.2, https://matplotlib.org/

Outlier value spotting with scatter chart#

  • As presented in the numpy lecture, outlier values are values that are distant (outside the standard deviation)
  • Scatter chart is useful for spotting outlier values
dataset = pd.DataFrame({
    "temp": [22.2,21.3,21.8,40.1,21.5,23.2,22.8,21.3,22.2,22.0,21.9,1.5,23.1,22.2,21.8,20.9,22.8],
    "hum": [100,110,108,130,102,106,111,109,100,101,107,101,102,108,107,104,106]
})

dataset.plot(kind="scatter",x="temp",y="hum",figsize=(6,5))
<AxesSubplot:xlabel='temp', ylabel='hum'>
2021-06-11T10:29:27.074513 image/svg+xml Matplotlib v3.3.2, https://matplotlib.org/
  • As can be seen from the scatter chart, there are two distant points from other points