4 Testing in Practice

Important

If you are newer to Python testing, you may want to start here: Testing in Python. This introduces some fundamental ideas behind testing as well as pytest specifics.

Why Testing is Hard

Typically most of the tests written for a piece of software are unit tests. Unit tests target the smallest “unit of behavior”. A single function can have multiple behaviors, for example:

def find_user(db: Connection, name: str) -> User | None:
    try
        row = db.users.find(name=name)
        return User(row[0], row[1], row[2])
    except NotFound:
        return None

This function would need at least two tests:

test_find_user - the successful case
test_find_user_missing - the unsuccessful case

It is important for unit tests to be independent, repeatable, and readable.

In practice, following these rules is challenging:

What constitutes a behavior is not always clear cut, often functions perform multiple related behaviors.
If a function contains lots of edge cases, it can require dozens of tests.
Called functions often take complex data structures, sometimes overshadowing the actual test code.
In applications, programs often modify internal state: files, databases, network, etc.. This requires setup (and teardown) and risks making tests fragile.

The difficulty of each of these is compounded by code that doesn’t consider how it will be tested as it is written. This is one of the key insights/motivations of TDD.

Let’s start by looking at some functions and considering how we’d test them:

Example 1: Random Avatar

def assign_random_avatar(user: User) -> None:
    """ generate a unique random avatar based on username and assign to user """

Tip

Tests: random avatar assigned (varies on username)

Challenges: randomness, no return value

Example 2: Login

def login(username: str, password: str) -> User:
    """ Return User if success or raises LoginError """

What behaviors should exist on this function?
What tests would we write?
What challenges will we encounter with test simplicity and independence?

Tip

Tests: success, bad_username, bad_password

Challenges: Database setup w/ valid user. (mock or fixture)

Example 3: Messaging

def send_message(from_user: str, to_user: str, message: str):
    """
    Uses messaging API service to add message to to_user's inbox from from_user.

    If the to_user has notifications on, they should receive a text message.

    If either user does not exist an error is raised.

    - Messages must be <=200 characters.
    - Only one message can be sent each second.
    """

Tip

Tests: success, bad user, bad message, rate limited, …

Challenges: Database setup again, testing message was sent (mocking!)

This is heading in the direction of an integration test, not a unit test. Do these interrelated systems work properly together? It will be useful to add unit tests of smaller components as well as this integration test.

Example 4: Cache

class FilesystemCache(Cache):
    def get_with_cache(url: str) -> Response:
        """
        If url has been seen, return old response, otherwise
        make a new request (using `self._request`) and add to cache.
        """

Tip

Tests: cache hit, cache miss

Challenges: filesystem (mock/helpers), relies on _request (mock)

Writing Testable Code

As we saw above, the signature of a function (and its dependencies) can have a major impact on our ability to test the code.

Avoid Unnecessary State

never mutate arguments
prefer return values
“do one thing”

Proper Decomposition

In the example above testing send_message is nearly impossible if it does not rely on a send_sms function of some kind.

With send_sms, we have a function we can mock

if we had a send_sms function we could test that independently somehow & then mock it within this function.

Dependency Injection

Another approach is to use dependency injection where these functions (or their containing classes):

class HTTPClient(Protocol):
    """Typed Protocol for a simple HTTP client."""

    def get(self, url: str, **kwargs: Any) -> "HTTPResponse": ...
    def post(self, url: str, body: Any = None, **kwargs: Any) -> "HTTPResponse": ...


class FilesystemCache(Cache):
    def __init__(self, http_client: HTTPClient)

FilesystemCache no longer provides its own _request, but instead dispatches to http_client.get. This makes testing easier as we can pass in a mocked HTTPClient that performs as we prefer for a given test.

This also makes our code more robust– we have decoupled behavior of caching & making the request. We could now easiliy swap out requests for httpx or whatever our preferred library is.

Fixtures / Parametrize

Often testing a single behavior is best done through a handful of tests:

def test_lang_add():
    assert parse_and_execute("3 + 4") == 7
    assert parse_and_execute("3 + -4") == -1
    assert parse_and_execute("0 + 0") == 0
    assert parse_and_execute("3 + 4") == 7

Can be replaced with parametrized tests:

@pytest.mark.parametrize("expr, expected", [
    ("3 + 4",   7),
    ("3 + -4", -1),
    ("0 + 0",   0),
])
def test_lang_add(expr, expected):
    assert parse_and_execute(expr) == expected

When the test inputs require some setup, a fixtures are helpful:

@pytest.fixture
def db():
    conn = connect("sqlite:///:memory:")
    conn.execute("CREATE TABLE users ...")
    yield conn
    conn.close()

def test_insert(db):
    db.execute("INSERT INTO users ...")
    assert db.execute("SELECT count(*) FROM users").scalar() == 1

def test_empty(db):
    assert db.execute("SELECT count(*) FROM users").scalar() == 0

When to use parametrize

Lots of similar inputs with slight variations, a very short (often one-line) actual test.

When not to use fixtures

When it requires a convoluted test harness that makes the test harder, not easier to read.

Mocking

Dependency injection is not always feasible or desired.

Let’s imagine we aren’t in a place to refactor this function:

def send_message(from_user: str, to_user: str, message: str):
    # ...
    from_number = get_user_number(from_user)
    to_number = get_user_number(to_user)
    if message_is_valid(message):
        send_sms(from_number, to_number, message)
    # ...

How would you test it without sending actual text messages?

from unittest.mock import patch

def test_send_message():
    with patch("yourmodule.send_sms") as mock_sms:
        send_message("alice", "bob", "hello")
        mock_sms.assert_called_once()

This uses a context manager (with) to replace the function yourmodule.send_sms with a Mock.

This is an object that can be treated as a function (Callable) or object. It uses metaprogramming techniques (our next topic) to work with any arguments/methods you attempt to call it with– unless otherwise specified these calls do nothing and return None.

The final line of the test: mock_sms.assert_called_once() is a method on the Mock that asserts that a function was called.

Some additional methods of Mock:

assert_called()
assert_called_once()
assert_called_with(*args, **kwargs) - last call only
assert_called_once_with(*args, **kwargs)
assert_any_call(*args, **kwargs) - has been called
assert_not_called()

Providing fake data

In practice, we often want a method to return certain values, such as when we’re trying to avoid a database call:

from unittest.mock import patch, call

def test_send_message():
    with (patch("mymodule.get_user_number", side_effect=["555-1111", "555-2222"]),
          patch("mymodule.send_sms") as mock_sms):

        send_message("alice", "bob", "hello")
        mock_sms.assert_called_once_with("555-1111", "555-2222", "hello")

from `unittest` to `pytest`

If you’ve learned Python in the past decade, likely using pytest– you may be surprised to know it ships with a testing library built-in.

`unittest`

unittest has been part of Python since Python 2, but has largely fallen out of favor. It is based on the a pattern described in Kent Beck’s Simple Smalltalk Testing: With Patterns and popularized in Java.

Let’s look at the first example from the unittest docs:

import unittest

class TestStringMethods(unittest.TestCase):

    def test_upper(self):
        self.assertEqual('foo'.upper(), 'FOO')

    def test_isupper(self):
        self.assertTrue('FOO'.isupper())
        self.assertFalse('Foo'.isupper())

    def test_split(self):
        s = 'hello world'
        self.assertEqual(s.split(), ['hello', 'world'])
        # check that s.split fails when the separator is not a string
        with self.assertRaises(TypeError):
            s.split(2)

if __name__ == '__main__':
    unittest.main()

Which in pytest we’d typically write:

def test_upper(self):
    assert 'foo'.upper() == "FOO"

def test_isupper(self):
    assert 'FOO'.isupper() is True
    assert 'Foo'.isupper() is False

def test_split(self):
    s = 'hello world'
    assert s.split() == ['hello', 'world']
    # check that s.split fails when the separator is not a string
    with pytest.raises(TypeError):
        s.split(2)

The latter is more Pythonic because in Python we avoid creating unnecessary classes where modules will do, and there’s nothing about most tests that requires a class.

Also, instead of custom assert methods we can mostly use the built-in assert and Python’s powerful introspection can give helpful error messages. Of course, pytest does come with built-in helpers for comparison of floating points, assertion checks, and other things hard to do with a bare assert.

Even when some setup/common methods are required, we can put those in the module, or use other pytest features to do this in a more flexible way than classes allow.

With that in mind, we can think of pytest as one of our core tools as Python developers, and one of the theories of this course is that it is important to know your core tools well. What else does pytest have to offer beyond a test runner with improved test semantics?

Testing Antipatterns

Testing Other People’s Code

from PIL import Image

def make_thumbnail(path: str, size: tuple) -> Image.Image:
    img = Image.open(path)
    img.thumbnail(size)
    return img

# This is a bad test-- it is actually testing Pillow
def test_make_thumbnail():
    result = make_thumbnail("photo.jpg", (128, 128))
    assert result.size == (128, 128)

The example above is in fact simple enough that we may not need a test, but if we did need to test code that relies upon a library we can mock it:

from unittest.mock import patch, MagicMock

def test_make_thumbnail():
    mock_img = MagicMock()
    with patch("mymodule.Image.open", return_value=mock_img):
        result = make_thumbnail("photo.jpg", (128, 128))
        # testing that we called the method properly, not what it returns
        mock_img.thumbnail.assert_called_once_with((128, 128))
        assert result == mock_img

Tests That Can’t Fail

When fixing a bug the typical advice is to write a failing test, then fix the bug ensuring the test passes.

The step of seeing the test actually fail is important to ensure you are testing (and fixing) the right thing!

Focusing on `coverage` above anything else

coverage, and the associated pytest plugin pytest-cov are incredibly useful tools.

coverage demo

At the same time, there are two common ways coverage is misused:

Thinking 100% coverage means no bugs.
Fixating on 100% coverage at the expense of everything else.
Coverage cannot tell the difference between pure unit tests & integration tests. If we rely on mocking we may not properly test units of behavior but still reach 100% coverage.

Example

For many years, I maintained a large open source project with hundreds of contributors. The project consisted of a data pipeline and 150 or so web scrapers. Our core code was well-tested, but the web scrapers had no tests, why not?

What would a test of a web scraper look like? What principles would it have a hard time not violating?

Alternative Forms of Testing

Snapshot Testing

import vcr

@vcr.use_cassette("fixtures/get_user.yaml")
def test_get_user_email():
    result = get_user_email(1)
    assert result == "alice@example.com"

interactions:
- request:
    body: null
    headers:
      Accept:
      - '*/*'
      User-Agent:
      - python-requests/2.28.0
    method: GET
    uri: https://api.example.com/users/1
  response:
    body:
      string: '{"id": 1, "email": "alice@example.com", "name": "Alice"}'
    headers:
      Content-Type:
      - application/json
    status:
      code: 200
      message: OK
version: 1

`hypothesis`

from hypothesis import given
from hypothesis import strategies as st

@given(st.text())
def test_encode_decode_roundtrip(s):
    assert decode(encode(s)) == s

Continuous Integration

It is generally accepted that tests should be run on every commit.

Most teams only allow code (at least on main) that is passing all tests.

Tests that aren’t run often aren’t serving their purpose & risk drifting out of sync with the codebase.

pre-commit
GitHub Actions (& equivalents)

Why Testing is Hard

Example 1: Random Avatar

Example 2: Login

Example 3: Messaging

Example 4: Cache

Writing Testable Code

Avoid Unnecessary State

Proper Decomposition

Dependency Injection

Fixtures / Parametrize

Mocking

Providing fake data

from unittest to pytest

unittest

Testing Antipatterns

Testing Other People’s Code

Tests That Can’t Fail

Focusing on coverage above anything else

Example

Alternative Forms of Testing

Snapshot Testing

hypothesis

Continuous Integration

from `unittest` to `pytest`

`unittest`

Focusing on `coverage` above anything else

`hypothesis`