Database in Machine Learning (2): Beyond Database
Technical Notes in Construction
In machine learning projects, database is not the sole media we need. We need other file formats as well for two reasons:
- It may happen that we need to read normal files as inputs of machine learning libraries. For example, in many machine learning projects, .csv file is often employed as inputs of machine learning training pipe line.
- Even when we use database to train/evaluate/test models, we still need file to support our process pipe line. For example, we may need JSON file to define the project’s settings.
In this article, we will detail on the different file formats we often meet in machine learning projects.
2. Configuration File Formats
What is JSON?
Many famous machine learning data bases use JSON to organize annotation label information. For example, the famous computer vision segmentation, detection and body key-points data base COCO.
JSON syntax format is based on two main structures; Objects and Arrays. An Object is a collection of name/value pairs, and Array is an ordered list of values.
JSON values include: 1) a string in double quotes; 2) a number; 3) boolean (true or false); 4) null; 5) an object; 6) an array.
json library for reading/writing JSON file, and please check json_demo.ipynb for more details. Here are some highlights:
json.loaddeserialze file and
json.loadsdeserialize a string
- it may happen the object that
json.loadsworks on is not a string but a bytestring, and in this case transforming bytestring to string is necessary
json.loads(data.decode('utf-8')); on the contrary, if you want to have a bytearray rather than a string, then you should use
class Params():"""Class that loads hyperparameters from a json file.Example:```params = Params(json_path)print(params.learning_rate)params.learning_rate = 0.5```"""def __init__(self, json_path): with open(json_path) as f: params = json.load(f) self.__dict__.update(params)def save(self, json_path): with open(json_path, 'w') as f: json.dump(self.__dict__, f, indent=4)def update(self, json_path): """Loads parameters from json file""" with open(json_path) as f: params = json.load(f) self.__dict__.update(params)@propertydef dict(self):"""Gives dict-like access to Params instance by params.dict['learning_rate']""" return self.__dict__
2.2 YAML & YML
YAML is a human-friendly data serialization standard for all programming languages. It is commonly used for configuration files.
— to represent list while
: represents associative relation (like a dictionary,
: left is the key while
: is the value. Please check yaml_demo.ipynb to find the link between
list with the contents of the file.
The basic data type supported by YAML is scalar(number, string), list, dict, etc. Check YAML Data types for more details.
- Python provides
yamlfor reading and writing .yaml file
- Pandas also has a
json_normalizefunction, which can help transform dictionary read from .yaml file to
# load file to be a dictionary
data = yaml.load(f, Loader=yaml.FullLoader)
# transform dict to dataframe
df = pd.json_normalize(data)
with open("ab.yaml", "w") as fid:
check yaml_demo.jpynb for more details.
What is XML?
XML stands for Extensible Markup Language. It is very similar to HTML at least in terms of formatting. The main difference between the two is that HTML has pre-defined tags that are standardized. In XML, tags can be tailored to the data set.
In Pascal VOC, which is another famous computer vision database, its annotations are saved as XML files
XML in Python
In Python, people use Beautiful Soup to manipulate XML files.
3. Storage File Formats
CSV stands for comma-separated values.
Read from URL
import pandas as pd
url = 'http://api.worldbank.org/v2/countries/br;cn;us;de/indicators/SP.POP.TOTL/?format=json&per_page=1000'
r = requests.get(url)