Live Life on the Edge: A Layered Strategy for Testing Data Models

1. About dataModelling testing orgMode literateProgramming
2. The Combinatorial Explosion Problem testing mathematics
- 2.1. Growth Analysis
- 2.2. A Caveat: Nobody Tests the Cartesian Product
3. Property-Based Testing with Polyfactory testing polyfactory
4. Value-Level Testing with Hypothesis testing hypothesis
5. The Testing Gap: When Models Aren't Enough testing gaps
- 5.1. Scientific Logic Errors
  - 5.1.1. Why Not Pydantic Validators?
6. Design by Contract with icontract testing icontract
7. Conclusion summary
8. tldr

1. About dataModelling testing orgMode literateProgramming

Figure 1: JPEG produced with DALL-E 4o

Data models live everywhere in a modern system: function signatures, bounded-context boundaries, database schemas, event payloads on the wire, etc…. And for good reason – the model isn't just a schema – it's an executable specification of how the system works. The criticality of data modelling has led to its ubiquity, and you will find a daunting proliferation of models throughout your software system, especially if you work in an enterprise where bias towards classic integration patterns urge developers to exhaustively model every payload and interface in the system. I call this the 'Model Everywhere Problem'.

But if models are so great, why is their pervasiveness a 'problem'?

Consider that a modest data model – a dozen fields, a few enums and optionals – has thousands of structural states. This set of a permissible instances is the model's 'state space', and large state spaces put demand on your testing suite (state space correlates with number of test cases).

On every engineering team I've worked on, I've noticed that these state spaces are almost never tested exhaustively, and instead only a few hardcoded instances of a model's possible instance are used for testing. Tragically, this causes avoidable issues once the production system begins processing real data with untested edge cases. This a problematic testing gap, and this is why the state space of a model – not just its happy path of a golden test instance – is the right unit to reason about when you write tests.

[the state space] of a model – not just its happy path – is the right unit to reason about when you write tests.

Using python as an example language, this post walks through a three-layer strategy that closes that gap, and – importantly – where each layer earns its keep and where it doesn't:

Polyfactory – automates structural partition coverage (every enum value, every optional state) without exploding into a Cartesian product
Hypothesis – probes value-level edge cases (boundary floats, unicode, NaN) with shrinking
icontract – enforces cross-field invariants that types and serializable schemas can't express

None of these tools are new. Hypothesis has been around since 2013; the ideas behind icontract (Design by Contract) date to Eiffel in 1986; equivalence partitioning and pairwise testing have been standard since Myers' Art of Software Testing. The contribution in this blog post is the layering pattern: when to reach for which tool, applied to a real scientific data model.

Versions used in this post: Python 3.11, Pydantic 2.x, Polyfactory 2.x, Hypothesis 6.x, icontract 2.x.

2. The Combinatorial Explosion Problem testing mathematics

Even modest models have enormous structural state spaces. Consider a realistic Pydantic model for a spectroscopy reading:

2.1. Growth Analysis

class SpectroscopyReading(BaseModel):
    # Required fields
    reading_id: str
    instrument_id: str

    # Optional fields
    wavelength_nm: Optional[float] = None
    temperature_K: Optional[float] = None
    pressure_atm: Optional[float] = None
    notes: Optional[str] = None

    # Enums
    sample_type: SampleType = SampleType.EXPERIMENTAL  # 3 values
    status: ReadingStatus = ReadingStatus.PENDING  # 4 values: PENDING, VALIDATED, FLAGGED, REJECTED
    instrument_mode: InstrumentMode = InstrumentMode.STANDARD  # 5 values: STANDARD, HIGH_RES, FAST, CALIBRATION, DIAGNOSTIC

    # Booleans
    is_validated: bool = True
    requires_review: bool = False
    is_replicate: bool = False

Now let's compute the combinations:

Field	States
reading_id	1
instrument_id	1
wavelength_nm	2
temperature_K	2
pressure_atm	2
notes	2
sample_type	3
status	4
instrument_mode	5
is_validated	2
requires_review	2
is_replicate	2

\[ \text{Combinations} = 1 \times 1 \times 2^4 \times 3 \times 4 \times 5 \times 2^3 = 16 \times 60 \times 8 = 7,680 \]

A handful of realistic fields produces 7,680 combinations. And this is just structural – we haven't even considered value-level edge cases (negative wavelengths, sub-absolute-zero temperatures, NaN concentrations) yet.

2.2. A Caveat: Nobody Tests the Cartesian Product

Each tool answers a question the previous one couldn't.

Before going further, the obvious objection: 7,680 combinations is not 7,680 tests you need to write. Equivalence partitioning (Myers, 1979) and combinatorial / pairwise testing (NIST has decades of work on $t$-way coverage) tell us that most "combinations" share a code path, and that 1-way and 2-way coverage catch the overwhelming majority of interaction bugs. The 7,680 number is the state space a tester has to reason about, not the number of cases to enumerate.

The point of the tools below is not brute force. Polyfactory's coverage() is essentially automated 1-way (every value of every field is exercised once); Hypothesis adds value-level probing where partition boundaries actually matter; icontract adds the cross-field rules that no equivalence partition can express. Each tool answers a question the previous one couldn't.

3. Property-Based Testing with Polyfactory testing polyfactory

Polyfactory is a library that generates mock data from Pydantic models (and other schemas). Instead of hand-writing test fixtures, you define a factory and let polyfactory generate valid instances.

3.1. Basic Usage: The Build Method

The build() method creates a single instance with randomly generated values that satisfy your model's constraints:

from polyfactory.factories.pydantic_factory import ModelFactory
from pydantic import BaseModel
from typing import Optional
from enum import Enum

class SampleType(str, Enum):
    CONTROL = "control"
    EXPERIMENTAL = "experimental"
    CALIBRATION = "calibration"

class Sample(BaseModel):
    sample_id: str
    experiment_id: str
    concentration_mM: Optional[float] = None
    sample_type: SampleType = SampleType.EXPERIMENTAL
    is_validated: bool = True

class SampleFactory(ModelFactory):
    __model__ = Sample

# Generate a random valid sample
sample = SampleFactory.build()
print(sample)

# Override specific fields
control_sample = SampleFactory.build(sample_type=SampleType.CONTROL, is_validated=True)

sample_id='aqgvWnszGSMJWogQRmWa' experiment_id='zPxKVxoDzwanDbZmhDNL' concentration_mM=None sample_type=<SampleType.CALIBRATION: 'calibration'> is_validated=False

Every call to build() gives you a valid instance. This is already powerful for unit tests where you need realistic test data without hand-crafting it.

3.2. Systematic Coverage: The Coverage Method

Here's where polyfactory really shines. The coverage() method generates multiple instances designed to cover all the structural variations of your model:

# Generate instances covering all structural variations
for sample in SampleFactory.coverage():
    print(f"type={sample.sample_type}, conc={'set' if sample.concentration_mM is not None else 'None'}, valid={sample.is_validated}")

type=SampleType.CONTROL, conc=set, valid=True
type=SampleType.EXPERIMENTAL, conc=None, valid=True
type=SampleType.CALIBRATION, conc=set, valid=False

The coverage() method systematically generates instances, but notice something important: we only got 3 instances, not the 12 we calculated earlier. This is by design.

3.2.1. How coverage() Actually Works

Polyfactory's coverage() uses a lockstep algorithm rather than a full Cartesian product. Here's how it works:

Each field gets a CoverageContainer that holds all possible values for that field (enum members, True=/=False for booleans, value=/=None for optionals)
All containers advance in parallel on each iteration – the first instance picks the first value from every container, the second instance picks the second from every container, and so on. Containers shorter than the longest one wrap around
Iteration stops when the "longest" container has been fully consumed – meaning every individual value of every field has been seen at least once

This produces a representative sample that guarantees:

Every enum value appears at least once
Both True and False appear for boolean fields
Both present and None states appear for optional fields

But it does not guarantee every combination is tested. In our 3 instances:

sample_type	concentration_mM	is_validated
CONTROL	set	False
EXPERIMENTAL	None	True
CALIBRATION	set	False

All enum values are covered. Both optional states (set/None) appear. Both boolean states appear. But we didn't test CONTROL with is_validated=True, for example. This is the deliberate trade-off: coverage() guarantees every individual value of every field is exercised, but misses interaction bugs that only manifest with specific combinations. For validation logic where fields are processed independently, that's enough. For complex interactions, supplement with targeted cases or reach for Hypothesis.

3.3. A Practical Example

Let's say we have a function that determines analysis priority based on sample attributes:

def determine_priority(sample: Sample) -> str:
    """Determine analysis priority based on sample type and validation status."""

    # Calibration samples are always high priority
    if sample.sample_type == SampleType.CALIBRATION:
        return "high"

    # Unvalidated samples need review first
    if not sample.is_validated:
        raise ValueError("Sample must be validated before analysis")

    # Control samples with known concentration get medium priority
    if sample.sample_type == SampleType.CONTROL and sample.concentration_mM is not None:
        return "medium"

    return "normal"

None

We can test this exhaustively using coverage():

import pytest

def test_priority_all_sample_variations():
    """Test priority determination across all sample variations."""
    results = []
    for sample in SampleFactory.coverage():
        if not sample.is_validated:
            try:
                determine_priority(sample)
                results.append(f"FAIL: {sample.sample_type.value}, validated={sample.is_validated} - expected ValueError")
            except ValueError:
                results.append(f"PASS: {sample.sample_type.value}, validated={sample.is_validated} - correctly raised ValueError")
        elif sample.sample_type == SampleType.CALIBRATION:
            priority = determine_priority(sample)
            if priority == "high":
                results.append(f"PASS: {sample.sample_type.value}, validated={sample.is_validated} -> {priority}")
            else:
                results.append(f"FAIL: {sample.sample_type.value} expected 'high', got '{priority}'")
        else:
            priority = determine_priority(sample)
            if priority in ["high", "medium", "normal"]:
                results.append(f"PASS: {sample.sample_type.value}, validated={sample.is_validated} -> {priority}")
            else:
                results.append(f"FAIL: {sample.sample_type.value} got invalid priority '{priority}'")
    return results

# Run the test and display results
print("Testing priority determination across all sample variations:")
print("-" * 60)
for result in test_priority_all_sample_variations():
    print(result)
print("-" * 60)
print(f"All variations tested!")

With a simple call to SampleFactory.coverage(), this single test covers every structural combination of our model. If we add new enum values or optional fields later, the test automatically expands to cover them.

3.4. The Reusable Fixture Pattern

Here's where things get powerful. We can create a reusable pytest fixture that applies this coverage-based testing pattern to any Pydantic model:

import pytest
from typing import Type, Iterator, TypeVar
from pydantic import BaseModel
from polyfactory.factories.pydantic_factory import ModelFactory

T = TypeVar("T", bound=BaseModel)

def create_factory(model: Type[T]) -> Type[ModelFactory[T]]:
    """Dynamically create a factory for any Pydantic model."""
    return type(f"{model.__name__}Factory", (ModelFactory,), {"__model__": model})

@pytest.fixture
def model_coverage(request) -> Iterator[BaseModel]:
    """
    Reusable fixture that yields all structural variations of a model.

    Usage:
        @pytest.mark.parametrize("model_class", [Sample, Measurement, Experiment])
        def test_serialization(model_coverage, model_class):
            for instance in model_coverage:
                assert instance.model_dump_json()
    """
    model_class = request.param
    factory = create_factory(model_class)
    yield from factory.coverage()

# Now testing ANY model is trivial:
@pytest.mark.parametrize("model_class", [Sample, SpectroscopyReading, Experiment])
def test_all_models_serialize(model_class):
    """Every model variation must serialize to JSON."""
    factory = create_factory(model_class)
    for instance in factory.coverage():
        json_str = instance.model_dump_json()
        restored = model_class.model_validate_json(json_str)
        assert restored == instance

This pattern is massively scalable. Add a new model to your codebase? Just add it to the parametrize list and you instantly get full structural coverage. The investment in the pattern pays dividends as your codebase grows.

4. Value-Level Testing with Hypothesis testing hypothesis

Polyfactory handles structural variations, but what about value-level edge cases? What happens when wavelength_nm is 0, or negative, or larger than the observable universe? This is where Hypothesis comes in.

Hypothesis is a property-based testing library. Instead of specifying exact test cases, you describe properties that should hold for any valid input, and Hypothesis generates hundreds of random inputs to try to break your code.

4.1. The @given Decorator

The @given decorator tells Hypothesis what kind of data to generate:

from hypothesis import given, strategies as st, settings

@given(st.integers())
@settings(max_examples=10)  # Limit for demo
def test_absolute_value_is_non_negative(n):
    """Property: absolute value is always >= 0"""
    assert abs(n) >= 0

@given(st.text())
@settings(max_examples=10)  # Limit for demo
def test_string_reversal_is_reversible(s):
    """Property: reversing twice gives original"""
    assert s[::-1][::-1] == s

# Run the tests and show output
print("Running Hypothesis tests:")
print("-" * 60)
try:
    test_absolute_value_is_non_negative()
    print("PASS: test_absolute_value_is_non_negative - all generated integers passed")
except AssertionError as e:
    print(f"FAIL: test_absolute_value_is_non_negative - {e}")

try:
    test_string_reversal_is_reversible()
    print("PASS: test_string_reversal_is_reversible - all generated strings passed")
except AssertionError as e:
    print(f"FAIL: test_string_reversal_is_reversible - {e}")
print("-" * 60)

Hypothesis will generate ~100 integers/strings per test run, including edge cases like 0, negative numbers, empty strings, unicode, etc.

4.2. The Chaos Hypothesis Unleashes

Hypothesis doesn't just generate "normal" test data – it actively tries to break your code with the most cursed inputs imaginable. Real-world string inputs are reliably worse than you imagine.

@given(st.text())
def test_sample_notes_field(notes: str):
    """What could go wrong with a simple notes field?"""
    sample = Sample(
        sample_id="test-001",
        experiment_id="exp-001",
        notes=notes  # Oh no.
    )
    process_sample(sample)

None

Hypothesis will helpfully try:

notes""= – The empty string. Classic.
notes"\\x00\\x00\\x00"= – Null bytes. Because why not?
notes"🧪🔬🧬💉"= – Your sample notes are now emoji. The lab notebook of the future.
notes"a" * 10_000_000= – Ten million 'a's. Hope you're not logging this.
notes"\\n\\n\\n\\n\\n"= – Just vibes (and newlines).
notes"ñoño"= – Unicode normalization enters the chat.
notes"🏳️‍🌈"= – A single "character" that's actually 6 code points (flag + variation selector + ZWJ + rainbow). Grapheme clusters: surprise!

Your function either handles these gracefully or you discover bugs you never knew you had. Usually the latter.

4.3. Combining Hypothesis with Pydantic

The real power comes from combining Hypothesis with our data models. Hypothesis has a from_type() strategy that can generate instances of Pydantic models:

from hypothesis import given, strategies as st
from hypothesis import settings

@given(st.from_type(Sample))
@settings(max_examples=20)  # Reduced for demo output
def test_sample_serialization_roundtrip(sample: Sample):
    """Property: serializing and deserializing preserves data"""
    json_str = sample.model_dump_json()
    restored = Sample.model_validate_json(json_str)
    assert restored == sample

# Run and show output
print("Testing Sample serialization roundtrip with Hypothesis:")
print("-" * 60)
try:
    test_sample_serialization_roundtrip()
    print("PASS: All 20 generated Sample instances serialized correctly")
except AssertionError as e:
    print(f"FAIL: {e}")
except Exception as e:
    print(f"ERROR: {type(e).__name__}: {e}")
print("-" * 60)

This test generates random valid Sample instances and verifies that JSON serialization works correctly for all of them.

4.4. Custom Strategies for Domain Constraints

Sometimes we need more control over generated values. In scientific domains, this is critical – our data has physical meaning, and randomly generated values often violate physical laws.

Let me show you what I mean with spectroscopy data:

from hypothesis import given, strategies as st, assume

# Strategy for wavelengths (must be positive, typically 200-1100nm for UV-Vis)
valid_wavelength = st.floats(min_value=200.0, max_value=1100.0, allow_nan=False)

# Strategy for temperature (above absolute zero, below plasma)
valid_temperature = st.floats(min_value=0.001, max_value=10000.0, allow_nan=False)

# Strategy for concentration (non-negative, physically reasonable)
valid_concentration = st.one_of(
    st.none(),
    st.floats(min_value=0.0, max_value=1000.0, allow_nan=False)  # millimolar
)

# Strategy for pressure (vacuum to high pressure, in atmospheres)
valid_pressure = st.floats(min_value=0.0, max_value=1000.0, allow_nan=False)

# Composite strategy with inter-field constraints
@st.composite
def spectroscopy_reading_strategy(draw):
    """Generate physically plausible spectroscopy readings."""
    wavelength = draw(valid_wavelength)
    pressure = draw(valid_pressure)

    # Domain constraint: at very low pressure, temperature readings are unreliable
    # (this is a real thing in vacuum spectroscopy!)
    # Draw temperature *conditionally* on pressure rather than drawing-then-filtering,
    # so we don't burn Hypothesis's example budget on rejected draws.
    if pressure < 0.01:
        temperature = draw(st.floats(min_value=100.0, max_value=10000.0, allow_nan=False))
    else:
        temperature = draw(valid_temperature)

    return SpectroscopyReading(
        reading_id=draw(st.text(min_size=1, max_size=50).filter(str.strip)),
        instrument_id=draw(st.sampled_from(["UV-1800", "FTIR-4600", "Raman-532"])),
        wavelength_nm=wavelength,
        temperature_K=temperature,
        pressure_atm=pressure,
        sample_type=draw(st.sampled_from(SampleType)),
        is_validated=draw(st.booleans())
    )

@given(spectroscopy_reading_strategy())
def test_reading_within_physical_bounds(reading: SpectroscopyReading):
    """Property: all readings must be physically plausible"""
    if reading.wavelength_nm is not None:
        assert reading.wavelength_nm > 0, "Negative wavelength is not a thing"
    if reading.temperature_K is not None:
        assert reading.temperature_K > 0, "Below absolute zero? Bold claim."

The key insight here is that scientific data has semantic constraints that go beyond type checking. A float can hold any value, but a wavelength of -500nm or a temperature of -273K is physically impossible. Custom strategies let us encode this domain knowledge.

4.5. Shrinking: Finding Minimal Failing Cases

One of Hypothesis's killer features is shrinking. When it finds a failing test case, it automatically simplifies it to find the minimal example that still fails. Instead of a failing case like:

SpectroscopyReading(reading_id='xK8jP2mQrS...', wavelength_nm=847293.7, temperature_K=9999.9, ...)

Hypothesis will shrink it to something like:

SpectroscopyReading(reading_id='a', wavelength_nm=1101.0, temperature_K=0.0, ...)

This makes debugging much easier – you immediately see that wavelength_nm=1101.0 (just outside our UV-Vis range) is the problem, not the giant random string.

4.6. What Hypothesis Actually Probes

The reason Hypothesis finds bugs naive random sampling misses isn't volume – it's targeting. Under the hood, st.floats() biases its draws toward values that historically break code: boundary values (min, max, just inside, just outside), zero, negative zero, the smallest positive subnormal, inf, -inf, and NaN when allowed. st.text() biases toward the empty string, single characters, surrogate pairs, and combining marks. Each strategy carries a "this is what bugs look like in this domain" prior.

Combine that with shrinking and you get the property-based testing loop: explore aggressively, fail fast, then minimise the failure to something a human can read in one line. Naive uniform random gives you neither.

4.7. When Hypothesis Isn't the Right Tool

Hypothesis is not 'free'. Specific cases where it earns its keep less:

Stateful workflows with expensive setup. If each example needs a fresh database, an HTTP fixture, or a multi-second container, 100 examples per test blows your CI budget. Use targeted cases or @reproduce_failure for regression pinning instead.
Impure code with hidden state. Shrinking assumes failures are reproducible from the shrunk input alone. If the failure depends on global mutable state, you get confusing minimised cases that don't actually fail in isolation.
Properties you can't actually state. "The function should return the right answer" is not a property. If the only oracle you have is another implementation, you have differential testing, not property testing – and that's a different (still useful) technique.
When a typed total function would do. A pure function with a tight signature and equivalence partitioning may not need a property at all. Don't reach for Hypothesis as a status symbol.

5. The Testing Gap: When Models Aren't Enough testing gaps

We've covered structural combinations with polyfactory and value-level edge cases with Hypothesis. This is powerful, but there's still a gap: runtime invariants that can't be expressed in the type system.

Consider this example from analytical chemistry:

class CalibrationCurve(BaseModel):
    readings: list[CalibrationPoint]
    r_squared: float
    slope: float
    intercept: float

    @field_validator('r_squared')
    @classmethod
    def validate_r_squared(cls, v):
        if not 0 <= v <= 1:
            raise ValueError('R² must be between 0 and 1')
        return v

None

Pydantic validates that r_squared is between 0 and 1. But what about this invariant?

The r_squared must be calculated from the actual readings using the slope and intercept.

This is a cross-field constraint – it depends on the relationship between multiple fields. And it's not just about validation at construction time. What if r_squared gets calculated incorrectly in our curve-fitting logic?

5.1. Scientific Logic Errors

Consider this function:

def recalculate_curve(curve: CalibrationCurve, new_reading: CalibrationPoint) -> CalibrationCurve:
    """Add a new calibration point and recalculate the curve."""
    all_readings = curve.readings + [new_reading]
    slope, intercept, r_squared = fit_linear_regression(all_readings)

    # BUG: accidentally swapped slope and intercept
    return CalibrationCurve(
        readings=all_readings,
        r_squared=r_squared,
        slope=intercept,  # BUG: wrong assignment!
        intercept=slope   # BUG: wrong assignment!
    )

None

This code has a subtle bug: the slope and intercept are swapped. Each field individually is a valid float, so Pydantic validation passes. But any concentration calculated from this curve will be wildly wrong.

Our Pydantic validation passes because each field is individually valid. Our Hypothesis tests might not catch this because they test properties at the data structure level, not scientific invariants.

5.1.1. Why Not Pydantic Validators?

You might be thinking: "Can't we add a @model_validator to Pydantic that checks if r_squared matches the fit?" Technically, yes:

class CalibrationCurve(BaseModel):
    # ... fields ...

    @model_validator(mode='after')
    def validate_r_squared_consistency(self) -> Self:
        # Check that r_squared matches the actual fit
        calculated_r2 = compute_r_squared(self.readings, self.slope, self.intercept)
        if abs(self.r_squared - calculated_r2) > 0.001:
            raise ValueError("R² doesn't match the fit")
        return self

But this approach has a significant drawback: custom validators don't serialize to standard schema formats¹.

In data engineering, your Pydantic models often need to export schemas for:

Avro (schema registries for Kafka)
JSON Schema (API documentation, OpenAPI specs)
Protobuf (gRPC services)
Database DDL (SQLAlchemy models, migrations)

These formats support type constraints and basic validation (nullable, enums, numeric ranges), but they have no way to represent arbitrary Python code like "R² must be computed from readings using least-squares regression."

Embedding complex validation logic in your model validators means:

The schema your consumers see is incomplete – it shows the fields but not the invariants
Other systems can't validate data independently – they must call your Python code
Schema evolution becomes fragile – changes to validation logic don't appear in schema diffs

By keeping Pydantic models "schema-clean" (only expressing constraints that can be serialized) and putting cross-field rules runtime contracts (see below), you at least keep your serialized schema honest about what it does and doesn't enforce.

This is where Design by Contract comes in.

6. Design by Contract with icontract testing icontract

icontract brings Design by Contract (DbC) to Python. DbC is a methodology where you specify:

Preconditions: What must be true before a function runs
Postconditions: What must be true after a function runs
Invariants: What must always be true about a class

If any condition is violated at runtime, you get an immediate, informative error.

6.1. Preconditions with @require

Preconditions specify what callers must guarantee:

import icontract

@icontract.require(lambda curve: len(curve.readings) >= 2, "Need at least 2 points to fit a curve")
@icontract.require(lambda new_reading: new_reading.concentration >= 0, "Concentration must be non-negative")
def recalculate_curve(curve: CalibrationCurve, new_reading: CalibrationPoint) -> CalibrationCurve:
    """Add a new calibration point and recalculate the curve."""
    all_readings = curve.readings + [new_reading]
    slope, intercept, r_squared = fit_linear_regression(all_readings)

    return CalibrationCurve(
        readings=all_readings,
        r_squared=r_squared,
        slope=slope,
        intercept=intercept
    )

None

If someone calls recalculate_curve with only one reading, they get an immediate ViolationError explaining which precondition failed.

6.2. Postconditions with @ensure

Postconditions specify what the function guarantees to return:

import icontract

@icontract.ensure(lambda result: 0 <= result.r_squared <= 1, "R² must be between 0 and 1")
@icontract.ensure(
    lambda curve, result: len(result.readings) == len(curve.readings) + 1,
    "Result must have exactly one more reading"
)
@icontract.ensure(
    lambda result: result.slope != 0 or all(abs(r.response - result.intercept) < 1e-9 for r in result.readings),
    "Zero slope only valid if all responses equal intercept"
)
def recalculate_curve(curve: CalibrationCurve, new_reading: CalibrationPoint) -> CalibrationCurve:
    """Add a new calibration point and recalculate the curve."""
    all_readings = curve.readings + [new_reading]
    slope, intercept, r_squared = fit_linear_regression(all_readings)

    return CalibrationCurve(
        readings=all_readings,
        r_squared=r_squared,
        slope=slope,
        intercept=intercept
    )

None

Now if our function produces an invalid result – even if it passes Pydantic validation – we catch it immediately.

6.3. Class Invariants with @invariant

For data models, class invariants are particularly powerful. They specify properties that must always hold:

import icontract
from pydantic import BaseModel
import numpy as np

def r_squared_matches_fit(self) -> bool:
    """Invariant: R² must be consistent with actual readings and coefficients."""
    if len(self.readings) < 2:
        return True  # Can't verify with insufficient data
    concentrations = np.array([r.concentration for r in self.readings])
    responses = np.array([r.response for r in self.readings])
    predicted = self.slope * concentrations + self.intercept
    ss_res = np.sum((responses - predicted) ** 2)
    ss_tot = np.sum((responses - np.mean(responses)) ** 2)
    calculated_r2 = 1 - (ss_res / ss_tot) if ss_tot > 0 else 1.0
    return abs(self.r_squared - calculated_r2) < 0.001  # Allow for floating point

@icontract.invariant(lambda self: len(self.readings) >= 2, "Calibration needs at least 2 points")
@icontract.invariant(r_squared_matches_fit, "R² must match actual fit quality")
class CalibrationCurve(BaseModel):
    readings: list[CalibrationPoint]
    r_squared: float
    slope: float
    intercept: float

    class Config:
        arbitrary_types_allowed = True

None

Now any CalibrationCurve instance that violates our scientific invariant will raise an error immediately – whether it's created directly, returned from a function, or modified anywhere in the system.

6.4. A Complete Example

Let's put it all together with a realistic example from a quality control workflow:

import icontract
from pydantic import BaseModel, field_validator
from typing import Optional
from enum import Enum
from datetime import datetime
import numpy as np

class QCStatus(str, Enum):
    PENDING = "pending"
    VALIDATED = "validated"
    FLAGGED = "flagged"
    REJECTED = "rejected"
    APPROVED = "approved"

class CalibrationPoint(BaseModel):
    concentration: float
    response: float
    replicate: int = 1

    @field_validator('concentration')
    @classmethod
    def validate_concentration(cls, v):
        if v < 0:
            raise ValueError('Concentration must be non-negative')
        return v

def r_squared_is_consistent(self) -> bool:
    """Invariant: R² must match the actual fit."""
    if len(self.readings) < 2:
        return True
    conc = np.array([r.concentration for r in self.readings])
    resp = np.array([r.response for r in self.readings])
    pred = self.slope * conc + self.intercept
    ss_res = np.sum((resp - pred) ** 2)
    ss_tot = np.sum((resp - np.mean(resp)) ** 2)
    calc_r2 = 1 - (ss_res / ss_tot) if ss_tot > 0 else 1.0
    return abs(self.r_squared - calc_r2) < 0.001

def approved_has_good_r_squared(self) -> bool:
    """Invariant: approved curves must have R² >= 0.99."""
    if self.status == QCStatus.APPROVED:
        return self.r_squared >= 0.99
    return True

@icontract.invariant(lambda self: len(self.readings) >= 2, "Need at least 2 calibration points")
@icontract.invariant(r_squared_is_consistent, "R² must match actual fit quality")
@icontract.invariant(approved_has_good_r_squared, "Approved curves need R² >= 0.99")
class CalibrationCurve(BaseModel):
    curve_id: str
    analyst_id: str
    readings: list[CalibrationPoint]
    slope: float
    intercept: float
    r_squared: float
    status: QCStatus = QCStatus.PENDING
    reviewer_notes: Optional[str] = None
    created_at: datetime

    class Config:
        arbitrary_types_allowed = True


# Function with contracts
@icontract.require(lambda curve: curve.status == QCStatus.PENDING,
                   "Can only validate pending curves")
@icontract.ensure(lambda result: result.status in [QCStatus.VALIDATED, QCStatus.FLAGGED],
                  "Validation must result in validated or flagged status")
def validate_curve(curve: CalibrationCurve) -> CalibrationCurve:
    """Validate a calibration curve based on R² threshold."""
    new_status = QCStatus.VALIDATED if curve.r_squared >= 0.99 else QCStatus.FLAGGED
    return CalibrationCurve(
        curve_id=curve.curve_id,
        analyst_id=curve.analyst_id,
        readings=curve.readings,
        slope=curve.slope,
        intercept=curve.intercept,
        r_squared=curve.r_squared,
        status=new_status,
        reviewer_notes=curve.reviewer_notes,
        created_at=curve.created_at
    )


@icontract.require(lambda curve: curve.status == QCStatus.VALIDATED,
                   "Can only approve validated curves")
@icontract.require(lambda reviewer_notes: reviewer_notes and reviewer_notes.strip(),
                   "Reviewer notes required for approval")
@icontract.ensure(lambda result: result.status == QCStatus.APPROVED,
                  "Curve must be approved after approval")
@icontract.ensure(lambda result: result.reviewer_notes is not None,
                  "Reviewer notes must be set")
def approve_curve(curve: CalibrationCurve, reviewer_notes: str) -> CalibrationCurve:
    """Approve a validated calibration curve."""
    return CalibrationCurve(
        curve_id=curve.curve_id,
        analyst_id=curve.analyst_id,
        readings=curve.readings,
        slope=curve.slope,
        intercept=curve.intercept,
        r_squared=curve.r_squared,
        status=QCStatus.APPROVED,
        reviewer_notes=reviewer_notes,
        created_at=curve.created_at
    )

None

With this setup:

You cannot create a CalibrationCurve that violates any invariant
You cannot call validate_curve on a non-pending curve
You cannot call approve_curve without reviewer notes
If any function returns an invalid CalibrationCurve, you get an immediate error

6.5. Combining Everything

The real power comes from combining all three approaches. Here's a complete test file that demonstrates all three techniques working together:

"""
Integration tests demonstrating Polyfactory, Hypothesis, and icontract together.

This file is tangled from post-data-model-testing.org and can be run with:
    pytest test_data_model_integration.py -v
"""
from typing import Optional
from enum import Enum
from datetime import datetime

import numpy as np
import icontract
import pytest
from pydantic import BaseModel, field_validator
from polyfactory.factories.pydantic_factory import ModelFactory
from polyfactory import Use
from hypothesis import given, strategies as st, settings


# =============================================================================
# DOMAIN MODELS (with icontract invariants)
# =============================================================================

class QCStatus(str, Enum):
    PENDING = "pending"
    VALIDATED = "validated"
    FLAGGED = "flagged"
    REJECTED = "rejected"
    APPROVED = "approved"


class CalibrationPoint(BaseModel):
    concentration: float
    response: float
    replicate: int = 1

    @field_validator('concentration')
    @classmethod
    def validate_concentration(cls, v):
        if v < 0:
            raise ValueError('Concentration must be non-negative')
        return v


def r_squared_is_consistent(self) -> bool:
    """Invariant: R² must match the actual fit."""
    if len(self.readings) < 2:
        return True
    conc = np.array([r.concentration for r in self.readings])
    resp = np.array([r.response for r in self.readings])
    pred = self.slope * conc + self.intercept
    ss_res = np.sum((resp - pred) ** 2)
    ss_tot = np.sum((resp - np.mean(resp)) ** 2)
    calc_r2 = 1 - (ss_res / ss_tot) if ss_tot > 0 else 1.0
    return abs(self.r_squared - calc_r2) < 0.001


def approved_has_good_r_squared(self) -> bool:
    """Invariant: approved curves must have R² >= 0.99."""
    if self.status == QCStatus.APPROVED:
        return self.r_squared >= 0.99
    return True


@icontract.invariant(lambda self: len(self.readings) >= 2, "Need at least 2 calibration points")
@icontract.invariant(r_squared_is_consistent, "R² must match actual fit quality")
@icontract.invariant(approved_has_good_r_squared, "Approved curves need R² >= 0.99")
class CalibrationCurve(BaseModel):
    curve_id: str
    analyst_id: str
    readings: list[CalibrationPoint]
    slope: float
    intercept: float
    r_squared: float
    status: QCStatus = QCStatus.PENDING
    reviewer_notes: Optional[str] = None
    created_at: datetime

    class Config:
        arbitrary_types_allowed = True


# =============================================================================
# DOMAIN FUNCTIONS (with icontract pre/post conditions)
# =============================================================================

@icontract.require(lambda curve: curve.status == QCStatus.PENDING,
                   "Can only validate pending curves")
@icontract.ensure(lambda result: result.status in [QCStatus.VALIDATED, QCStatus.FLAGGED],
                  "Validation must result in validated or flagged status")
def validate_curve(curve: CalibrationCurve) -> CalibrationCurve:
    """Validate a calibration curve based on R² threshold."""
    new_status = QCStatus.VALIDATED if curve.r_squared >= 0.99 else QCStatus.FLAGGED
    return CalibrationCurve(
        curve_id=curve.curve_id,
        analyst_id=curve.analyst_id,
        readings=curve.readings,
        slope=curve.slope,
        intercept=curve.intercept,
        r_squared=curve.r_squared,
        status=new_status,
        reviewer_notes=curve.reviewer_notes,
        created_at=curve.created_at
    )


@icontract.require(lambda curve: curve.status == QCStatus.VALIDATED,
                   "Can only approve validated curves")
@icontract.require(lambda reviewer_notes: reviewer_notes and reviewer_notes.strip(),
                   "Reviewer notes required for approval")
@icontract.ensure(lambda result: result.status == QCStatus.APPROVED,
                  "Curve must be approved after approval")
def approve_curve(curve: CalibrationCurve, reviewer_notes: str) -> CalibrationCurve:
    """Approve a validated calibration curve."""
    return CalibrationCurve(
        curve_id=curve.curve_id,
        analyst_id=curve.analyst_id,
        readings=curve.readings,
        slope=curve.slope,
        intercept=curve.intercept,
        r_squared=curve.r_squared,
        status=QCStatus.APPROVED,
        reviewer_notes=reviewer_notes,
        created_at=curve.created_at
    )


# =============================================================================
# POLYFACTORY FACTORIES
# =============================================================================

class CalibrationPointFactory(ModelFactory):
    __model__ = CalibrationPoint
    # Constrain concentration to non-negative values (matching Pydantic validator)
    concentration = Use(lambda: ModelFactory.__random__.uniform(0, 1000))


class CalibrationCurveFactory(ModelFactory):
    __model__ = CalibrationCurve

    @classmethod
    def build(cls, **kwargs):
        # Generate readings that produce a valid fit
        readings = kwargs.get('readings') or [
            CalibrationPointFactory.build(concentration=float(i), response=float(i * 2.5 + 1.0))
            for i in range(5)
        ]
        # Calculate actual fit parameters
        conc = np.array([r.concentration for r in readings])
        resp = np.array([r.response for r in readings])
        slope, intercept = np.polyfit(conc, resp, 1)
        pred = slope * conc + intercept
        ss_res = np.sum((resp - pred) ** 2)
        ss_tot = np.sum((resp - np.mean(resp)) ** 2)
        r_squared = 1 - (ss_res / ss_tot) if ss_tot > 0 else 1.0

        return super().build(
            readings=readings,
            slope=slope,
            intercept=intercept,
            r_squared=r_squared,
            **{k: v for k, v in kwargs.items() if k not in ['readings', 'slope', 'intercept', 'r_squared']}
        )


# =============================================================================
# TESTS
# =============================================================================

class TestPolyfactoryCoverage:
    """Tests using Polyfactory's systematic coverage."""

    def test_qc_workflow_all_combinations(self):
        """Test QC workflow with polyfactory coverage - structural edge cases.

        Note: We iterate over QCStatus values manually because CalibrationCurve
        has complex invariants (R² consistency, minimum readings) that coverage()
        can't satisfy automatically. This demonstrates intentional structural
        coverage of the state machine.
        """
        tested_statuses = []

        for status in QCStatus:
            # Build a valid curve with this status
            curve = CalibrationCurveFactory.build(status=status)
            tested_statuses.append(status)

            if curve.status == QCStatus.PENDING:
                validated = validate_curve(curve)
                assert validated.status in [QCStatus.VALIDATED, QCStatus.FLAGGED]

                if validated.status == QCStatus.VALIDATED:
                    approved = approve_curve(validated, "Meets all QC criteria")
                    assert approved.status == QCStatus.APPROVED
                    assert approved.reviewer_notes is not None

        # Verify we tested all status values
        assert set(tested_statuses) == set(QCStatus)


class TestHypothesisProperties:
    """Property-based tests using Hypothesis."""

    @given(st.builds(
        CalibrationPoint,
        concentration=st.floats(min_value=0, max_value=1000, allow_nan=False),
        response=st.floats(min_value=0, max_value=10000, allow_nan=False),
        replicate=st.integers(min_value=1, max_value=10)
    ))
    @settings(max_examples=50)
    def test_calibration_point_concentration_non_negative(self, point: CalibrationPoint):
        """Hypothesis: concentration must be non-negative."""
        assert point.concentration >= 0

    @given(st.builds(
        CalibrationPoint,
        concentration=st.floats(min_value=0, max_value=1000, allow_nan=False),
        response=st.floats(min_value=0, max_value=10000, allow_nan=False),
        replicate=st.integers(min_value=1, max_value=10)
    ))
    @settings(max_examples=50)
    def test_calibration_point_response_is_finite(self, point: CalibrationPoint):
        """Hypothesis: response values are finite numbers."""
        assert np.isfinite(point.response)


class TestIcontractInvariants:
    """Tests verifying icontract catches invalid states."""

    def test_contracts_catch_invalid_r_squared(self):
        """Verify contracts catch scientifically invalid R² values."""
        readings = [
            CalibrationPoint(concentration=1.0, response=2.5),
            CalibrationPoint(concentration=2.0, response=5.0),
        ]

        # Try to create a curve with fake R² that doesn't match the data
        with pytest.raises(icontract.ViolationError) as exc_info:
            CalibrationCurve(
                curve_id="cal-001",
                analyst_id="analyst-1",
                readings=readings,
                slope=2.5,
                intercept=0.0,
                r_squared=0.5,  # Wrong! Actual R² is ~1.0
                created_at=datetime.now()
            )
        assert "R² must match actual fit quality" in str(exc_info.value)

    def test_contracts_require_minimum_readings(self):
        """Verify contracts require at least 2 calibration points."""
        with pytest.raises(icontract.ViolationError) as exc_info:
            CalibrationCurve(
                curve_id="cal-002",
                analyst_id="analyst-1",
                readings=[CalibrationPoint(concentration=1.0, response=2.5)],  # Only 1!
                slope=2.5,
                intercept=0.0,
                r_squared=1.0,
                created_at=datetime.now()
            )
        assert "Need at least 2 calibration points" in str(exc_info.value)

    def test_validate_requires_pending_status(self):
        """Verify validate_curve requires pending status."""
        readings = [
            CalibrationPointFactory.build(concentration=float(i), response=float(i * 2.5 + 1.0))
            for i in range(5)
        ]
        conc = np.array([r.concentration for r in readings])
        resp = np.array([r.response for r in readings])
        slope, intercept = np.polyfit(conc, resp, 1)

        curve = CalibrationCurve(
            curve_id="cal-003",
            analyst_id="analyst-1",
            readings=readings,
            slope=slope,
            intercept=intercept,
            r_squared=1.0,
            status=QCStatus.VALIDATED,  # Not pending!
            created_at=datetime.now()
        )

        with pytest.raises(icontract.ViolationError) as exc_info:
            validate_curve(curve)
        assert "Can only validate pending curves" in str(exc_info.value)

Now we run the tests with pytest:

cd ~/projects/lab-data && poetry run pytest test_data_model_integration.py -vvvv -q --disable-warnings --tb=short 2>&1

============================= test session starts ==============================
platform darwin -- Python 3.11.6, pytest-9.0.2, pluggy-1.6.0 -- ~/projects/lab-data/.venv/bin/python
cachedir: .pytest_cache
hypothesis profile 'default'
rootdir: ~/projects/lab-data
configfile: pyproject.toml
plugins: Faker-37.11.0, hypothesis-6.142.3
collecting ... collected 6 items

test_data_model_integration.py::TestPolyfactoryCoverage::test_qc_workflow_all_combinations PASSED [ 16%]
test_data_model_integration.py::TestHypothesisProperties::test_calibration_point_concentration_non_negative PASSED [ 33%]
test_data_model_integration.py::TestHypothesisProperties::test_calibration_point_response_is_finite PASSED [ 50%]
test_data_model_integration.py::TestIcontractInvariants::test_contracts_catch_invalid_r_squared PASSED [ 66%]
test_data_model_integration.py::TestIcontractInvariants::test_contracts_require_minimum_readings PASSED [ 83%]
test_data_model_integration.py::TestIcontractInvariants::test_validate_requires_pending_status PASSED [100%]

======================== 6 passed, 3 warnings in 1.40s =========================

6.6. Caveats with icontract

An honest caveat: icontract has the same opacity problem as @model_validator. An @invariant is arbitrary Python – it does not serialize to Avro, JSON Schema, or Protobuf either. Moving the logic out of Pydantic doesn't make it visible to downstream consumers; it just stops it from polluting the schema export. If a downstream service in another language needs to enforce "R² matches the fit", you still have to re-implement it there (or push the check into a shared validation service). This is a real limit of any in-process invariant approach – the alternative is consumer-driven contract testing (Pact and friends), which is a different toolchain entirely and out of scope here.

And a runtime cost. @invariant checks run on every method call on the instance, not just construction. The r_squared_is_consistent check below does a numpy polyfit on every invocation; on a hot path (a Kafka consumer processing thousands of messages per second) this is a real cost. Note that python -O does not help here – icontract raises ViolationError unconditionally and does not piggyback on assert. The right knobs are icontract's own enabled argument on each contract, the ICONTRACT_SLOW environment variable for gating expensive checks, or building two configurations of your application. Either way: if you switch contracts off in prod, your "runtime safety net" is only a test-time safety net, and you need to be honest with yourself about that.

6.7. Alternatives to runtime contract enforcement

Alternatives worth weighing. Before reaching for runtime contracts, three other approaches deserve a serious look:

Schema-first with codegen. Treat Avro/Protobuf/JSON Schema as the source of truth and generate Pydantic (or your language's equivalent) from it. Cross-system drift becomes structurally impossible because there's only one definition. This is the right answer when your data crosses many language boundaries. Its weakness is exactly the one this post is about: schema languages can't express cross-field invariants either, so you still need something for "R² must match the fit."
Consumer-driven contract testing (Pact and friends). Push the enforcement to the boundary between services rather than inside any one of them. This is the right answer when the question is "do producer and consumer agree?" It's the wrong answer when the question is "is this single object internally coherent?", which is what we have here.
In-process invariants (icontract, @model_validator, plain assertions in __init__). Cheapest to add, lives next to the data, and – as discussed above – invisible to anything outside your Python process.

These are not mutually exclusive. A mature system typically has schema-first definitions at the wire, CDC tests at the service boundaries, and in-process invariants for the rules that live entirely inside one bounded context. This post is about that last layer.

7. Conclusion summary

The example here is a calibration curve, but the pattern is not domain-specific. Anywhere a data model carries a rule the type system can't express, the same three layers apply:

An Order where discount < subtotal=, tax is a function of subtotal - discount, and total is the sum. Pydantic accepts any three floats; only a cross-field check rejects the inconsistent invoice.
An OAuthToken where the granted scopes must be a subset of the client's allowed_scopes, and expires_at > issued_at. Each field is structurally fine in isolation.
An inventory StockMovement where on_hand_after = on_hand_before + delta. Off-by-one in a service layer produces a "valid" object that silently corrupts every downstream report.

In each case the bug looks the same as the calibration-curve bug: every field passes its own validator, the JSON serialises cleanly, and the wrongness only shows up when you ask whether the fields agree with each other. That is the bug class this stack is for.

The three layers, each catching a different class:

Technique	Catches	When
Polyfactory	Structural combinations	Test generation
Hypothesis	Value-level edge cases	Test execution
icontract	Cross-field invariants	Runtime

Three independent failure modes, three independent tools. Start with polyfactory's coverage() for structural completeness. Add Hypothesis for value-level probing. Use icontract for invariants that can't be expressed in types – the swapped slope and intercept, the discount > subtotal, the StockMovement that doesn't add up.

8. tldr

TLDR: A modest Pydantic model – a dozen fields with a few enums and optionals – has 7,680 valid structural shapes. Your tests probably cover four of them. This post is a three-layer pattern for closing that gap – and an honest accounting of where each layer does and doesn't earn its keep.

Polyfactory's coverage() automates 1-way structural partition coverage so you stop hand-writing fixtures for "every enum value × every nullable state". Hypothesis adds value-level probing – boundary floats, NaN, unicode – and shrinks failing cases to minimal examples. icontract enforces cross-field invariants (like "R² must match the actual fit") that no type system or serializable schema can express.

The worked example is a scientific calibration curve, with all three tools running in a real pytest file you can copy. The post is also explicit about the limits: equivalence partitioning has been standard since the 1970s, @invariant has the same schema-opacity problem as @model_validator, and runtime contracts cost real CPU on hot paths.

Footnotes:

Data standards are the connective tissue of cross-system integration, and in my experience the right default is to use them all the way down – as the source of truth, not as a downstream artefact. You don't necessarily write the schema files by hand; they can be generated from Python (e.g. from Pydantic models). But if you go that route, it is essential that every constraint your in-memory model enforces also appears in the exported schema. Anything that doesn't survive serialisation is a constraint your downstream consumers cannot see.