7 Evolution of Data Types

We often want to pass around multiple related values as arguments to functions and return values.

`dict` and `tuple`

When you first learn Python you encounter two common solutions to this problem:

def http_request() -> tuple[int, str]:
    ...
    return status, response_text

This implicit tuple return is OK for two or maybe three values, but any more than that risks hurting readability & introducing hard-to-spot bugs:

def http_request() -> tuple[int, str]:
    ...
    return status, response_text, headers # third param, requires updating

# now this is broken!
status, response_text = http_request(...)

It’s also easy to mix up ordering:

# oops!
response_text, status = http_request(...)

One solution to this is to use a dict with named fields.

def http_request(...) -> dict:
    return {"content": content, "status": status, "headers": headers}

This removes the risk of order-based bugs, and makes it easier to add fields.

Warning

Also, remember that accidental mutation is an incredibly common source of bugs in Python.

def mutable_default(k, v, d={}):
    d[k] = v
    return d

mutable_default("a", 1)
mutable_default("b", 2)

{'a': 1, 'b': 2}

It is also easy to accidentally return dictionaries with different keys in different branches, requiring careful use of dict.get or similar.

def http_request(...) -> dict:
    if error:
        return {"status_code": 404}
    else:
        return {"status_code": 200, "body": "...", "headers": {...}}

response = http_request()
# now using response["body"] requires careful checks

`namedtuple`

collections.namedtuple was introduced in Python 2.6, and has been superceded by typing.NamedTuple which uses type annotation syntax to define a new type which is a specialized tuple where elements have names as well as numeric positions.

from typing import NamedTuple

class Response(NamedTuple):
    status: int
    text: str
    headers: dict[str, str]

These tuple types can be constructed with positional or named args:

resp1 = Response(200, "<html>...", {"content-type": "text/html"})
resp2 = Response(status=404, text="{}", headers={"content-type": "application/json"})

They can also be accessed as attributes or by index:

print(resp1.status, resp1.text, resp1.headers)
print(resp2[0], resp2[1], resp2[2])

200 <html>... {'content-type': 'text/html'}
404 {} {'content-type': 'application/json'}

They are immutable, which works well for parameters, but sometimes one needs a mutable alternative.

classes

An obvious alternative to using a tuple or dict for data that is frequently being passed around together is to write a class.

This comes with some overhead, but also has the advantage of offering the opportunity for custom behavior.

An application managing complex state with dozens of variables is almost always going to settle on one or more classes, but it is often unclear when it is appropriate to introduce a class as opposed to a dict or tuple.

One potential downside is the dynamic nature of classes. The ability to add attributes can lead to bugs, and that isn’t necessary if we know exactly what fields our class is going to have.

`slots`

For these data bundles, one option is to define the __slots__ attribute on a class:

class Point2:
    def __init__(self, x, y):
        self.x = x
        self.y = y

class Point2S:
    # with slots
    __slots__ = ("x", "y")
    def __init__(self, x, y):
        self.x = x
        self.y = y

ptA = Point2(0, 0)
ptA.z = 0 # valid, but probably an error!

try:
    ptB = Point2S(5, 5)
    ptB.z = 0
except Exception as e:
    print('AttributeError', e)

AttributeError 'Point2S' object has no attribute 'z' and no __dict__ for setting new attributes

https://wiki.python.org/moin/UsingSlots

`@dataclass`

As type-checking became standardized, the attrs library introduced a new way to think about creating data container classes. This heavily influenced the design of dataclassses, added in Python 3.7.

In their simplest form, they resemble how we declared our NamedTuple before:

from dataclasses import dataclass

@dataclass
class Point2D:
    x: float
    y: float

ptD = Point2D(1, 2)
print(ptD)

ptD.z = 0

Point2D(x=1, y=2)

By default, the dataclass decorator adds an __init__, __eq__, and __repr__, perfect for a dataclass.

It is also possible to customize the created class, the full signature of the decorator:

@dataclasses.dataclass(*, init=True, repr=True, eq=True, order=False, unsafe_hash=False, frozen=False, match_args=True, kw_only=False, slots=False, weakref_slot=False)¶

init - generate constructor
repr - generate repr method
eq - generate __eq__ method using == on all attributes
order - generate ordering methods (<, >, <=, >=)
unsafe_hash - generate hash method even if unsafe (will be generated by default if eq and frozen are true)
frozen - make instances immutable
match_args - generate __match_args__, a dunder method used for customizing match behavior
kw_only - make constructor parameters keyword only
slots - generate __slots__, ensuring no additional attributes are added
weakref_slots - add __weakref__ to slots (see documentation)

https://docs.python.org/3/library/dataclasses.html

Depsite the name, dataclass allows creating ordinary classes by default, or using frozen and/or slots, classes more suited for data packaging.

`__post_init__`

Dataclasses can also define a “secondary constructor” that will be called by their generated __init__. The generated __init__ will call self.__post_init__().

Field Options

Fields on a dataclass are typically just type annotations on class variables. It is also possible to assign a default that will be used in the generated __init__.

Sometimes it is desirable to control more about the field, in which case you’d assign it to a field:

dataclasses.field(*, default=MISSING, default_factory=MISSING, init=True, repr=True, hash=None, compare=True, metadata=None, kw_only=MISSING, doc=None)¶

from dataclasses import dataclass, field

@dataclass
class User:
    name: str
    age: int = -1
    aliases: list[str] = field(default_factory=list, repr=False)

f = User("Finn")
j = User("Jake", 30)
r = User("Robert", 30, aliases=["Bob"])
print(f)
print(j)
print(r)

User(name='Finn', age=-1)
User(name='Jake', age=30)
User(name='Robert', age=30)

Dataclasses: Under the Hood

Dataclasses are implemented using the existing metaprogramming machinery we’ve already seen.

Take a look at the implementation: https://github.com/python/cpython/blob/main/Lib/dataclasses.py

Pydantic

dataclasses do not validate input, this is in line with how Python typically handles types and type hints.

That said it isn’t uncommon to want validation, especially for data shared over a network. APIs, databases, and data pipelines can all benefit from data validation.

pydantic is a popular library which uses dataclass-like syntax to enable validation:

from datetime import datetime
from pydantic import BaseModel, PositiveInt

class User(BaseModel):
    id: int  
    name: str = 'John Doe'  
    signup_ts: datetime | None  = None


try:
    u = User(id="abc", name=123) # oops transposed arguments!
except Exception as e:
    print(e)

2 validation errors for User
id
  Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='abc', input_type=str]
    For further information visit https://errors.pydantic.dev/2.10/v/int_parsing
name
  Input should be a valid string [type=string_type, input_value=123, input_type=int]
    For further information visit https://errors.pydantic.dev/2.10/v/string_type

How to Choose

aside: performance test

$ uv run perftest.py
Type             Time (ms)   vs class
--------------------------------------
dict                1761.1  (0.93x)
namedtuple          3277.1  (1.73x)
class               1889.8 ◄ baseline
class+slots         2170.6  (1.15x)
dataclass           1855.1  (0.98x)

Note: These differences are minor, it took 10 million runs to see notable/consistent difference.

There should be one– and preferably only one– obvious way to do it. Although that way may not be obvious at first unless you’re Dutch.

Do you need validation? Pydantic
Do you want your type to have methods? dataclasses
Immutable? NamedTuple or dataclass(freeze=True)
Full set of fields not known? dict
Complex constructors not based on attributes? class

dict and tuple

namedtuple