Database in Machine Learning (2): Beyond Database

Technical Notes in Construction

1. Introduction

In machine learning projects, database is not the sole media we need. We need other file formats as well for two reasons:

In this article, we will detail on the different file formats we often meet in machine learning projects.

2. Configuration File Formats

2.1 JSON

What is JSON?

JSON stands for “JavaScript Object Notation”, It’s used to store and exchange data as an alternative solution for XML.

Many famous machine learning data bases use JSON to organize annotation label information. For example, the famous computer vision segmentation, detection and body key-points data base COCO.

JSON Syntax

JSON syntax format is based on two main structures; Objects and Arrays. An Object is a collection of name/value pairs, and Array is an ordered list of values.

JSON values include: 1) a string in double quotes; 2) a number; 3) boolean (true or false); 4) null; 5) an object; 6) an array.

JSON Python

Python provides json library for reading/writing JSON file, and please check json_demo.ipynb for more details. Here are some highlights:

class Params():"""Class that loads hyperparameters from a json file.Example:```params = Params(json_path)print(params.learning_rate)params.learning_rate = 0.5```"""def __init__(self, json_path):    with open(json_path) as f:       params = json.load(f)    self.__dict__.update(params)def save(self, json_path):    with open(json_path, 'w') as f:       json.dump(self.__dict__, f, indent=4)def update(self, json_path):   """Loads parameters from json file"""    with open(json_path) as f:       params = json.load(f)       self.__dict__.update(params)@propertydef dict(self):"""Gives dict-like access to Params instance by params.dict['learning_rate']"""    return self.__dict__

2.2 YAML & YML

What’s YAML?

YAML is a human-friendly data serialization standard for all programming languages. It is commonly used for configuration files.

YAML Syntax

We use to represent list while : represents associative relation (like a dictionary, : left is the key while : is the value. Please check yaml_demo.ipynb to find the link between dict and list with the contents of the file.

The basic data type supported by YAML is scalar(number, string), list, dict, etc. Check YAML Data types for more details.

YAML Python

# load file to be a dictionary
data = yaml.load(f, Loader=yaml.FullLoader)
# transform dict to dataframe
df = pd.json_normalize(data)
# writing
with open("ab.yaml", "w") as fid:
yaml.dump(a_dict, fid)

check yaml_demo.jpynb for more details.

2.3 XML

What is XML?

XML stands for Extensible Markup Language. It is very similar to HTML at least in terms of formatting. The main difference between the two is that HTML has pre-defined tags that are standardized. In XML, tags can be tailored to the data set.

In Pascal VOC, which is another famous computer vision database, its annotations are saved as XML files

XML in Python

In Python, people use Beautiful Soup to manipulate XML files.

3. Storage File Formats

3.1 CSV

What’s CSV?

CSV stands for comma-separated values.

3.2 HDF

3.3 URL

Read from URL

import requests
import pandas as pd
url = 'http://api.worldbank.org/v2/countries/br;cn;us;de/indicators/SP.POP.TOTL/?format=json&per_page=1000'
r = requests.get(url)
r.json()
pd.DataFrame(r.json()[1])
print(r.json())