Data sources#
- In this lecture we examine different data sources and how we could read datasets in different formats available through these sources with Python
- Datasets usually don't have one common data format so it's essential to familiarize ourselves with these
CSV#
- CSV (Comma Separated Values) is a file type where data is stored in plain text
- It is tabular data where columns are usually separated with commas (,)
- CSV file can also have other separator types like tab (\t), colon (:) and semi-colon (;)
- Later in this course we are going to use Pandas for data manipulation
- Pandas has CSV file reading capability included so we could use either it or import csv library to Python
- Both of these approaches are presented below
# Method 1: Using read_csv method in Pandas
import pandas as pd
# Read CSV file containing car data
cars_csv_pandas = pd.read_csv("cars.csv",delimiter=";")
cars_csv_pandas
# Method 2: Using csv library
import csv
with open("cars.csv","r") as f:
cars_csv = csv.reader(f)
for row in cars_csv:
print(row)
- As can be seen from the two examples above, the data output is a bit different on both of these
- Pandas we will be using later on this course when we convert the data into the dataframe
- At this point let's modify the csv data read with csv library to easier to read format
# Modify csv library example output
with open("cars.csv","r") as f:
cars_csv = csv.reader(f)
car_data = []
for row in cars_csv:
# Each row is inside a list as a string (one element in each list) so we will access it with [0]
# In addition, each row will be converted from a string datatype into a list using split() method
element = row[0].split(";")
# After splitting, each row will be appended to car_data list
car_data.append(element)
print(car_data)
- At this point data can be more easily accessed since each field can be referred with list index
- This collection of car data could now be transferred into the dictionary like shown below
# First element in the car data collection contains titles which will be used as keys in dictionary
titles = car_data[0]
# Second element contains the information of datatypes so it will be left out from the actual car data
cars = car_data[2:]
# Now we can create a set of dictionaries which we will store into a new list
car_collection = []
# Outer loop will go through sublists which each contains the data for one car
for car in cars:
# Empty object will be created for each iteration
obj = {}
# Inner loop will go through index numbers from 0 to 8 (same amount for titles and cars)
for j in range(len(car)):
# Create key-value pairs for the object (match index numbers in titles and cars)
obj[titles[j]] = car[j]
# Insert created object to collection
car_collection.append(obj)
print(car_collection)
- Now we could present a single car information like shown in the example below
print(car_collection[10])
- Another example would be to show all cars with acceleration value 10 or less
# Show cars with acceleration value 10 or less
for i in car_collection:
if float(i["Acceleration"]) <= 10:
print(i)
JSON#
- Json (JavaScript Object Notation) is usually used as a data format when data is transferred between client and server
- Data structure in json is quite similar to dictionary presented earlier in this course
- However whereas dictionaries are usually used for working with data in your program locally, json is used for sending it between programs (for example between client and server like mentioned above)
- And to be precise, json is a serialization format and dictionary is a data structure
- Below is an example where json containing stock price data is read using json library in Python
import json
with open('stocks.json') as stockdata:
stocks = json.load(stockdata)
print(stocks)
- We could now calculate retrieve the maximum and minimum values from stock prices in the following manner:
- get the maximum number from the high key
- get the minimum value from the low key
# Initialize
high_values = []
low_values = []
# All daily high and low stock values are recorded in appropriate lists
for i in stocks:
high_values.append(i["high"])
low_values.append(i["low"])
# max() and min() functions are used for retrieving the lowest and highest values from lists
print("Highest stock price was {}".format(max(high_values)))
print("Lowest stock price was {}".format(min(low_values)))
XML#
- XML (Extensible Markup Language) is used to store and transfer data between systems
- Similarly to HTML, data in XML is presented between tags
- Below is an example where XML file containing person data is read
- XML data can be accessed with the following attributes:
- .tag = Name of the element
- .text = Text content between tags
import xml.etree.ElementTree as e
tree = e.parse("persons.xml")
root = tree.getroot()
person_collection = []
# Gather titles from one element
titles = []
for i in root[0]:
titles.append(i.tag)
for elem in root:
person = []
for subelem in elem:
person.append(subelem.text)
person_collection.append(person)
print(person_collection)
- Now that we have all our XML data moved to a list containing each person information as a sublist, we could clear the structure of the data by creating dictionaries from each person's information and storing them in the list
persons = []
for lst in person_collection:
person_obj = {}
for i in range(len(lst)):
person_obj[titles[i]] = lst[i]
persons.append(person_obj)
print(persons)
- We will filter our data so that we only print persons who:
- have brackets () in their phone number and
- have their email address ending with either co.uk or com
# Let's loop each sublist and check the value in second and the last field
for i in persons:
# Check that BOTH brackets will be found from the string in the second index position
# Check that either '.com' OR '.co.uk' substring is found from the string in the last index position
if ('(' and ')' in i["Phone"]) and ('.com' or '.co.uk' in i["Email"]):
print(i)
Rest API#
- Rest APIs (Representational state transfer Application Programming Interface) are built for data retrieval, update, delete and insertion of a new data
- API queries are run on top of the HTTP protocol and thus data manipulation is done using HTTP methods
- Data can be retrieved straight with web browser (although not very effective way especially with larger data sets)
- The following HTTP methods can be used for data manipulation:
- POST (data insertion)
- GET (data retrieval)
- PUT (data update)
- DELETE (data deletion)
- Data manipulation is based on
- url paths,
- parameters included in url and
- data included in HTTP header (usually authentication data)
- The basic idea of the Rest API is presented in the image below
- User sends Rest API request to the server
- Request includes HTTP method (What to do with the data) and URI (the location and name of the resource)
-
Server responds with the appropriate data, usually in XML or JSON format
-
Below is an example which presents the cat facts API
- Query is sent using Python's requests library
import requests
endpoint = "/facts"
user_id = "5894af975cdc7400113ef7f9"
url = "https://cat-fact.herokuapp.com{0}/{1}".format(endpoint, user_id)
req = requests.get(url=url)
req.status_code
- We got status code 200 as a response from the server, which means a successful HTTP request
- Since the data is available in JSON format, we could save it to a variable for further manipulation
cat_data = req.json()
cat_data
- API returns one cat fact based on given ID at the end of the url
- We can now pick the actual text from the response
cat_data["text"]