Technical Notes in Construction
1. Introduction
In machine learning projects, database is not the sole media we need. We need other file formats as well for two reasons:
- It may happen that we need to read normal files as inputs of machine learning libraries. For example, in many machine learning projects, .csv file is often employed as inputs of machine learning training pipe line.
- Even when we use database to train/evaluate/test models, we still need file to support our process pipe line. For example, we may need JSON file to define the project’s settings.
In this article, we will detail on the different file formats we often meet in machine learning projects.
2. Configuration File Formats
2.1 JSON
What is JSON?
JSON stands for “JavaScript Object Notation”, It’s used to store and exchange data as an alternative solution for XML.
Many famous machine learning data bases use JSON to organize annotation label information. For example, the famous computer vision segmentation, detection and body key-points data base COCO.
JSON Syntax
JSON syntax format is based on two main structures; Objects and Arrays. An Object is a collection of name/value pairs, and Array is an ordered list of values.
JSON values include: 1) a string in double quotes; 2) a number; 3) boolean (true or false); 4) null; 5) an object; 6) an array.
JSON Python
Python provides json
library for reading/writing JSON file, and please check json_demo.ipynb for more details. Here are some highlights:
json.load
deserialze file andjson.loads
deserialize a string- it may happen the object that
json.loads
works on is not a string but a bytestring, and in this case transforming bytestring to string is necessaryjson.loads(data.decode('utf-8'))
; on the contrary, if you want to have a bytearray rather than a string, then you should usebytearray(json.dumps(dict), 'utf-8')
class Params():"""Class that loads hyperparameters from a json file.Example:```params = Params(json_path)print(params.learning_rate)params.learning_rate = 0.5```"""def __init__(self, json_path): with open(json_path) as f: params = json.load(f) self.__dict__.update(params)def save(self, json_path): with open(json_path, 'w') as f: json.dump(self.__dict__, f, indent=4)def update(self, json_path): """Loads parameters from json file""" with open(json_path) as f: params = json.load(f) self.__dict__.update(params)@propertydef dict(self):"""Gives dict-like access to Params instance by params.dict['learning_rate']""" return self.__dict__
2.2 YAML & YML
What’s YAML?
YAML is a human-friendly data serialization standard for all programming languages. It is commonly used for configuration files.
YAML Syntax
We use —
to represent list while :
represents associative relation (like a dictionary, :
left is the key while :
is the value. Please check yaml_demo.ipynb to find the link between dict
and list
with the contents of the file.
The basic data type supported by YAML is scalar(number, string), list, dict, etc. Check YAML Data types for more details.
YAML Python
- Python provides
yaml
for reading and writing .yaml file - Pandas also has a
json_normalize
function, which can help transform dictionary read from .yaml file toDataFrame
.
# load file to be a dictionary
data = yaml.load(f, Loader=yaml.FullLoader)
# transform dict to dataframe
df = pd.json_normalize(data)
# writing
with open("ab.yaml", "w") as fid:
yaml.dump(a_dict, fid)
check yaml_demo.jpynb for more details.
2.3 XML
What is XML?
XML stands for Extensible Markup Language. It is very similar to HTML at least in terms of formatting. The main difference between the two is that HTML has pre-defined tags that are standardized. In XML, tags can be tailored to the data set.
In Pascal VOC, which is another famous computer vision database, its annotations are saved as XML files
XML in Python
In Python, people use Beautiful Soup to manipulate XML files.
3. Storage File Formats
3.1 CSV
What’s CSV?
CSV stands for comma-separated values.
3.2 HDF
3.3 URL
Read from URL
import requests
import pandas as pd
url = 'http://api.worldbank.org/v2/countries/br;cn;us;de/indicators/SP.POP.TOTL/?format=json&per_page=1000'
r = requests.get(url)
r.json()
pd.DataFrame(r.json()[1])
print(r.json())