Record-like Data Structure in Python

ifeelfree
4 min readFeb 20, 2021

--

Part 1: Introduction

In order to represent data in Python, we have the following options:

  • write a class that organize data
  • use named tuple / dictionary or their enhanced versions (from typing)
  • employ dataclass

We recommend to use dataclass as it provides a lot of possibilities.

However, using dataclass can be problematic from the software engineering perspective:

These are classes that have fields, getting and setting methods for fields, and nothing else.Such classes are dumb data holders and are often being manipulated in far too much detail by other classes.

— from Refactoring of Martin Fowler and Kent Beck

The main idea of OOP is to place behavior and data together in the same code unit.

Part 2: NamedTuple/NamedDict

namedtuple

from collections import namedtuple
Coordinate = namedtuple('Coordinate', 'lat long')

namedtuple from typing

from typing import NamedTupleclass Coordinate(NamedTuple):    lat: float
long: float
def __str__(self):
ns = 'N' if self.lat >= 0 else 'S'
we = 'E' if self.long >= 0 else 'W'
return f'{abs(self.lat):.1f}°{ns}, {abs(self.long):.1f}°{we}'
a = Coordinate(3.4, 5.6)

Example can be found in my github.

Part 3: Dataclasses

from dataclasses import dataclass

@dataclass(frozen=True)
class Coordinate:

lat: float
long: float=3.4
def __str__(self):
ns = 'N' if self.lat >= 0 else 'S'
we = 'E' if self.long >= 0 else 'W'
return f'{abs(self.lat):.1f}°{ns}, {abs(self.long):.1f}°{we}'

(1) The default setting in @dataclass is as follows:

@dataclass(*, init=True, repr=True, eq=True, order=False,              unsafe_hash=False, frozen=False)

This default setting can be modified for each class member:

from dataclasses import dataclass, field
@dataclass(order=True)
class PlayingCard:
sort_index: int = field(init=False, repr=False)
val:int
def __post_init__(self):
self.sort_index = self.val*2

a= PlayingCard(3)
print(a)
print(a.sort_index)

(2) Field options do not include mutable types, and this is because mutable default values are a common source of bugs for Python developers. Therefore, @dataclass does not support mutable type initialization. However, it allows to define class member variable as mutable.

@dataclass
class MyNumber:
a:list

obj = MyNumber(['a','b'])
print(obj)

However, the following definition is problematic

@dataclass
class MyNumber:
a:list=['a','b']

obj = MyNumber() # error will occur

In order to solve this problem, a work-around solution is to use field method:

Example 1: empty list

from dataclasses import dataclass, field
@dataclass
class ClubMember:
name: str
guests: list = field(default_factory=list)

a= ClubMember("my")
a.guests.append('3')
a.guests.append('4')
print(a)

Example 2: list with initialization

import randomfrom typing import Listdef get_random_marks():
return [random.randint(1,10) for _ in range(5)]
@dataclass
class Student:
marks: list = field(default_factory=get_random_marks)
a = Student()
print(a)

Example 3: list with type annotation and initialization

from dataclasses import field
from typing import List
def get_random_marks():
return [random.randint(1,10) for _ in range(5)]
@dataclass
class Student:
marks: List[int] = field(default_factory=get_random_marks)

b = Student()
print(b)

(3) class attribute vs class member

@dataclass
class MyClass:
all_set_as_class_attribute = set({'p1','p2'})
all_set:set
a = MyClass({'a','b'})
b = MyClass({'aa','bb'})
print(a)
print(b)
print(a.all_set_as_class_attribute)
a.all_set_as_class_attribute.add('ppppppp')
print(b.all_set_as_class_attribute)
print(MyClass.all_set_as_class_attribute)

Its output is as follows:

MyClass(all_set={'a', 'b'})
MyClass(all_set={'aa', 'bb'})
{'p1', 'p2'}
{'p1', 'p2', 'ppppppp'}
{'p1', 'p2', 'ppppppp'}

My understanding is that when the variable in the class is mutable, then this variable can be shared with many other class objects.

My understanding is that when the variable is initialized from the beginning, it will be regarded as class variable.

@dataclass
class MyClass:
obj_name:str='ab'
a = MyClass()
print(a)
b = MyClass('ef')
print(b)
print(MyClass.obj_name)

Its output is:

MyClass(obj_name='ab')
MyClass(obj_name='ef')
ab

(4) __post_init__ is used to post-process the initialized @dataclass object.

@dataclass
class MyClass:
all_objects = set() # all_objects:ClassVar[Set[str]] = set()
obj_name:str='a'

def __post_init__(self):
cls = self.__class__
if self.obj_name:
cls.all_objects.add(self.obj_name)

a = MyClass('a')
b = MyClass('b')
c = MyClass('c')
d = MyClass('b')
print(a.all_objects)

Its output is:

{'a', 'b', 'c'}

Another very good example of __post_init__ provides a way of looking up an item from the database

from dataclasses import dataclass, InitVar

@dataclass
class C:
i: int
j: int = None
database: InitVar[int] = None
def __post_init__(self, database):
if self.j is None and database is not None:
self.j = database.lookup('j')
c = C(10, database=my_database)

In this example InitVar means that variable must be initialized by the class definition.

My understanding of of InitVar and field(init=False, repr=False) is that InitVar is not regarded as class member variable.

InitVar Example

from dataclasses import dataclass, InitVar

@dataclass
class C:
i: int
j: int = None
database: InitVar[int] = None
def __post_init__(self, database):
if self.j is None and database is not None:
self.j = self.i in database#database.lookup('j')
c = C(10, database=["a", "b", "c"])
print(c)

field Example

from dataclasses import dataclass, field
@dataclass(order=True)
class PlayingCard:
sort_index: int = field(init=False, repr=False)
val:int
val2:int
def __post_init__(self):
self.sort_index = self.val2*2

a= PlayingCard(3,30)
print(a)
print(a.sort_index)

Part 4: Struct

struct is used to construct data structure for reading C/C++ structure data.

Structure in C++

struct MetroArea {
int year;
char name[12];
char country[2];
float population;
};

Read C++ structure in Python

from struct import unpack
FORMAT = 'i12s2sf'
def text(field: bytes) -> str: # <2>
octets = field.split(b'\0', 1)[0] # <3>
return octets.decode('cp437') # <4>
with open('metro_areas.bin', 'rb') as fp: # <5>
data = fp.read()
for fields in iter_unpack(FORMAT, data): # <6>
year, name, country, pop = fields
place = text(name) + ', ' + text(country) # <7>
print(f'{year}\t{place}\t{pop:,.0f}')

struct and memoryview are used to interpret bytes as packed binary data.

Part 5: Reference

Blogs

Book

Codes

--

--

No responses yet