Database in Machine Learning (2): Beyond Database

3 min readDec 31, 2020

Technical Notes in Construction

1. Introduction

In machine learning projects, database is not the sole media we need. We need other file formats as well for two reasons:

It may happen that we need to read normal files as inputs of machine learning libraries. For example, in many machine learning projects, .csv file is often employed as inputs of machine learning training pipe line.
Even when we use database to train/evaluate/test models, we still need file to support our process pipe line. For example, we may need JSON file to define the project’s settings.

In this article, we will detail on the different file formats we often meet in machine learning projects.

2. Configuration File Formats

2.1 JSON

What is JSON?

JSON stands for “JavaScript Object Notation”, It’s used to store and exchange data as an alternative solution for XML.

Many famous machine learning data bases use JSON to organize annotation label information. For example, the famous computer vision segmentation, detection and body key-points data base COCO.

JSON Syntax

JSON syntax format is based on two main structures; Objects and Arrays. An Object is a collection of name/value pairs, and Array is an ordered list of values.

JSON values include: 1) a string in double quotes; 2) a number; 3) boolean (true or false); 4) null; 5) an object; 6) an array.

JSON Python

Python provides json library for reading/writing JSON file, and please check json_demo.ipynb for more details. Here are some highlights:

json.load deserialze file and json.loads deserialize a string
it may happen the object that json.loads works on is not a string but a bytestring, and in this case transforming bytestring to string is necessary json.loads(data.decode('utf-8')) ; on the contrary, if you want to have a bytearray rather than a string, then you should use bytearray(json.dumps(dict), 'utf-8')

class Params():"""Class that loads hyperparameters from a json file.Example:```params = Params(json_path)print(params.learning_rate)params.learning_rate = 0.5```"""def __init__(self, json_path):    with open(json_path) as f:       params = json.load(f)    self.__dict__.update(params)def save(self, json_path):    with open(json_path, 'w') as f:       json.dump(self.__dict__, f, indent=4)def update(self, json_path):   """Loads parameters from json file"""    with open(json_path) as f:       params = json.load(f)       self.__dict__.update(params)@propertydef dict(self):"""Gives dict-like access to Params instance by params.dict['learning_rate']"""    return self.__dict__

2.2 YAML & YML

What’s YAML?

YAML is a human-friendly data serialization standard for all programming languages. It is commonly used for configuration files.

YAML Syntax

We use — to represent list while : represents associative relation (like a dictionary, : left is the key while : is the value. Please check yaml_demo.ipynb to find the link between dict and list with the contents of the file.

The basic data type supported by YAML is scalar(number, string), list, dict, etc. Check YAML Data types for more details.

YAML Python

Python provides yaml for reading and writing .yaml file
Pandas also has a json_normalize function, which can help transform dictionary read from .yaml file to DataFrame.

# load file to be a dictionary
data = yaml.load(f, Loader=yaml.FullLoader)
# transform dict to dataframe
df = pd.json_normalize(data)
# writing 
with open("ab.yaml", "w") as fid:
    yaml.dump(a_dict, fid)

check yaml_demo.jpynb for more details.

2.3 XML

What is XML?

XML stands for Extensible Markup Language. It is very similar to HTML at least in terms of formatting. The main difference between the two is that HTML has pre-defined tags that are standardized. In XML, tags can be tailored to the data set.

In Pascal VOC, which is another famous computer vision database, its annotations are saved as XML files

XML in Python

In Python, people use Beautiful Soup to manipulate XML files.

3. Storage File Formats

3.1 CSV

What’s CSV?

CSV stands for comma-separated values.

3.2 HDF

3.3 URL

Read from URL

import requests
import pandas as pd
url = 'http://api.worldbank.org/v2/countries/br;cn;us;de/indicators/SP.POP.TOTL/?format=json&per_page=1000'
r = requests.get(url)
r.json()
pd.DataFrame(r.json()[1])
print(r.json())