Database in Machine Learning (2): Beyond Database

1. Introduction

  • It may happen that we need to read normal files as inputs of machine learning libraries. For example, in many machine learning projects, .csv file is often employed as inputs of machine learning training pipe line.
  • Even when we use database to train/evaluate/test models, we still need file to support our process pipe line. For example, we may need JSON file to define the project’s settings.

2. Configuration File Formats

2.1 JSON

  • json.load deserialze file and json.loads deserialize a string
  • it may happen the object that json.loads works on is not a string but a bytestring, and in this case transforming bytestring to string is necessary json.loads(data.decode('utf-8')) ; on the contrary, if you want to have a bytearray rather than a string, then you should use bytearray(json.dumps(dict), 'utf-8')
class Params():"""Class that loads hyperparameters from a json file.Example:```params = Params(json_path)print(params.learning_rate)params.learning_rate = 0.5```"""def __init__(self, json_path):    with open(json_path) as f:       params = json.load(f)    self.__dict__.update(params)def save(self, json_path):    with open(json_path, 'w') as f:       json.dump(self.__dict__, f, indent=4)def update(self, json_path):   """Loads parameters from json file"""    with open(json_path) as f:       params = json.load(f)       self.__dict__.update(params)@propertydef dict(self):"""Gives dict-like access to Params instance by params.dict['learning_rate']"""    return self.__dict__

2.2 YAML & YML

  • Python provides yaml for reading and writing .yaml file
  • Pandas also has a json_normalize function, which can help transform dictionary read from .yaml file to DataFrame.
# load file to be a dictionary
data = yaml.load(f, Loader=yaml.FullLoader)
# transform dict to dataframe
df = pd.json_normalize(data)
# writing
with open("ab.yaml", "w") as fid:
yaml.dump(a_dict, fid)

2.3 XML

3. Storage File Formats

3.1 CSV

3.2 HDF

3.3 URL

import requests
import pandas as pd
url = 'http://api.worldbank.org/v2/countries/br;cn;us;de/indicators/SP.POP.TOTL/?format=json&per_page=1000'
r = requests.get(url)
r.json()
pd.DataFrame(r.json()[1])
print(r.json())

--

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

STREAMING | Addu Marathon — Maldives 2021' Livestream | Live_HD

Being Agile without Scrum

Migrating Jenkins on AWS using Terraform

OSINT in Discord

Problems with Packages

Writing an Elm UI Framework — Part 2: Colors

5 Best SQL Certifications to boost your Career in 2021

How to pick the right BI Tool for your business?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
ifeelfree

ifeelfree

More from Medium

My First Month in a Semester Long Internship!

New Computer science project topic ideas

Placement Diaries: Bhanu Pratap

My first experience as Data Science Intern at LetsGrowMore