Live Life on the Edge: A Layered Strategy for Testing Data Models
Table of Contents
- 1. About dataModelling testing orgMode literateProgramming
- 2. The Combinatorial Explosion Problem testing mathematics
- 3. Property-Based Testing with Polyfactory testing polyfactory
- 4. Value-Level Testing with Hypothesis testing hypothesis
- 5. The Testing Gap: When Models Aren't Enough testing gaps
- 6. Design by Contract with icontract testing icontract
- 7. Conclusion summary
- 8. tldr
1. About dataModelling testing orgMode literateProgramming
Figure 1: JPEG produced with DALL-E 4o
Data models live everywhere in a modern system: function signatures, bounded-context boundaries, database schemas, event payloads on the wire, etc…. And for good reason – the model isn't just a schema – it's an executable specification of how the system works. The criticality of data modelling has led to its ubiquity, and you will find a daunting proliferation of models throughout your software system, especially if you work in an enterprise where bias towards classic integration patterns urge developers to exhaustively model every payload and interface in the system. I call this the 'Model Everywhere Problem'.
But if models are so great, why is their pervasiveness a 'problem'?
Consider that a modest data model – a dozen fields, a few enums and optionals – has thousands of structural states. This set of a permissible instances is the model's 'state space', and large state spaces put demand on your testing suite (state space correlates with number of test cases).
On every engineering team I've worked on, I've noticed that these state spaces are almost never tested exhaustively, and instead only a few hardcoded instances of a model's possible instance are used for testing. Tragically, this causes avoidable issues once the production system begins processing real data with untested edge cases. This a problematic testing gap, and this is why the state space of a model – not just its happy path of a golden test instance – is the right unit to reason about when you write tests.
[the state space] of a model – not just its happy path – is the right unit to reason about when you write tests.
Using python as an example language, this post walks through a three-layer strategy that closes that gap, and – importantly – where each layer earns its keep and where it doesn't:
- Polyfactory – automates structural partition coverage (every enum value, every optional state) without exploding into a Cartesian product
- Hypothesis – probes value-level edge cases (boundary floats, unicode, NaN) with shrinking
- icontract – enforces cross-field invariants that types and serializable schemas can't express
None of these tools are new. Hypothesis has been around since 2013; the ideas behind icontract (Design by Contract) date to Eiffel in 1986; equivalence partitioning and pairwise testing have been standard since Myers' Art of Software Testing. The contribution in this blog post is the layering pattern: when to reach for which tool, applied to a real scientific data model.
Versions used in this post: Python 3.11, Pydantic 2.x, Polyfactory 2.x, Hypothesis 6.x, icontract 2.x.
2. The Combinatorial Explosion Problem testing mathematics
Even modest models have enormous structural state spaces. Consider a realistic Pydantic model for a spectroscopy reading:
2.1. Growth Analysis
class SpectroscopyReading(BaseModel): # Required fields reading_id: str instrument_id: str # Optional fields wavelength_nm: Optional[float] = None temperature_K: Optional[float] = None pressure_atm: Optional[float] = None notes: Optional[str] = None # Enums sample_type: SampleType = SampleType.EXPERIMENTAL # 3 values status: ReadingStatus = ReadingStatus.PENDING # 4 values: PENDING, VALIDATED, FLAGGED, REJECTED instrument_mode: InstrumentMode = InstrumentMode.STANDARD # 5 values: STANDARD, HIGH_RES, FAST, CALIBRATION, DIAGNOSTIC # Booleans is_validated: bool = True requires_review: bool = False is_replicate: bool = False
Now let's compute the combinations:
| Field | States |
|---|---|
| reading_id | 1 |
| instrument_id | 1 |
| wavelength_nm | 2 |
| temperature_K | 2 |
| pressure_atm | 2 |
| notes | 2 |
| sample_type | 3 |
| status | 4 |
| instrument_mode | 5 |
| is_validated | 2 |
| requires_review | 2 |
| is_replicate | 2 |
\[ \text{Combinations} = 1 \times 1 \times 2^4 \times 3 \times 4 \times 5 \times 2^3 = 16 \times 60 \times 8 = 7,680 \]
A handful of realistic fields produces 7,680 combinations. And this is just structural – we haven't even considered value-level edge cases (negative wavelengths, sub-absolute-zero temperatures, NaN concentrations) yet.
2.2. A Caveat: Nobody Tests the Cartesian Product
Each tool answers a question the previous one couldn't.
Before going further, the obvious objection: 7,680 combinations is not 7,680 tests you need to write. Equivalence partitioning (Myers, 1979) and combinatorial / pairwise testing (NIST has decades of work on $t$-way coverage) tell us that most "combinations" share a code path, and that 1-way and 2-way coverage catch the overwhelming majority of interaction bugs. The 7,680 number is the state space a tester has to reason about, not the number of cases to enumerate.
The point of the tools below is not brute force. Polyfactory's coverage() is essentially automated 1-way (every value of every field is exercised once); Hypothesis adds value-level probing where partition boundaries actually matter; icontract adds the cross-field rules that no equivalence partition can express. Each tool answers a question the previous one couldn't.
3. Property-Based Testing with Polyfactory testing polyfactory
Polyfactory is a library that generates mock data from Pydantic models (and other schemas). Instead of hand-writing test fixtures, you define a factory and let polyfactory generate valid instances.
3.1. Basic Usage: The Build Method
The build() method creates a single instance with randomly generated values that satisfy your model's constraints:
from polyfactory.factories.pydantic_factory import ModelFactory from pydantic import BaseModel from typing import Optional from enum import Enum class SampleType(str, Enum): CONTROL = "control" EXPERIMENTAL = "experimental" CALIBRATION = "calibration" class Sample(BaseModel): sample_id: str experiment_id: str concentration_mM: Optional[float] = None sample_type: SampleType = SampleType.EXPERIMENTAL is_validated: bool = True class SampleFactory(ModelFactory): __model__ = Sample # Generate a random valid sample sample = SampleFactory.build() print(sample) # Override specific fields control_sample = SampleFactory.build(sample_type=SampleType.CONTROL, is_validated=True)
sample_id='aqgvWnszGSMJWogQRmWa' experiment_id='zPxKVxoDzwanDbZmhDNL' concentration_mM=None sample_type=<SampleType.CALIBRATION: 'calibration'> is_validated=False
Every call to build() gives you a valid instance. This is already powerful for unit tests where you need realistic test data without hand-crafting it.
3.2. Systematic Coverage: The Coverage Method
Here's where polyfactory really shines. The coverage() method generates multiple instances designed to cover all the structural variations of your model:
# Generate instances covering all structural variations for sample in SampleFactory.coverage(): print(f"type={sample.sample_type}, conc={'set' if sample.concentration_mM is not None else 'None'}, valid={sample.is_validated}")
type=SampleType.CONTROL, conc=set, valid=True type=SampleType.EXPERIMENTAL, conc=None, valid=True type=SampleType.CALIBRATION, conc=set, valid=False
The coverage() method systematically generates instances, but notice something important: we only got 3 instances, not the 12 we calculated earlier. This is by design.
3.2.1. How coverage() Actually Works
Polyfactory's coverage() uses a lockstep algorithm rather than a full Cartesian product. Here's how it works:
- Each field gets a
CoverageContainerthat holds all possible values for that field (enum members,True=/=Falsefor booleans,value=/=Nonefor optionals) - All containers advance in parallel on each iteration – the first instance picks the first value from every container, the second instance picks the second from every container, and so on. Containers shorter than the longest one wrap around
- Iteration stops when the "longest" container has been fully consumed – meaning every individual value of every field has been seen at least once
This produces a representative sample that guarantees:
- Every enum value appears at least once
- Both
TrueandFalseappear for boolean fields - Both
presentandNonestates appear for optional fields
But it does not guarantee every combination is tested. In our 3 instances:
| sample_type | concentration_mM | is_validated |
|---|---|---|
| CONTROL | set | False |
| EXPERIMENTAL | None | True |
| CALIBRATION | set | False |
All enum values are covered. Both optional states (set/None) appear. Both boolean states appear. But we didn't test CONTROL with is_validated=True, for example. This is the deliberate trade-off: coverage() guarantees every individual value of every field is exercised, but misses interaction bugs that only manifest with specific combinations. For validation logic where fields are processed independently, that's enough. For complex interactions, supplement with targeted cases or reach for Hypothesis.
3.3. A Practical Example
Let's say we have a function that determines analysis priority based on sample attributes:
def determine_priority(sample: Sample) -> str: """Determine analysis priority based on sample type and validation status.""" # Calibration samples are always high priority if sample.sample_type == SampleType.CALIBRATION: return "high" # Unvalidated samples need review first if not sample.is_validated: raise ValueError("Sample must be validated before analysis") # Control samples with known concentration get medium priority if sample.sample_type == SampleType.CONTROL and sample.concentration_mM is not None: return "medium" return "normal"
None
We can test this exhaustively using coverage():
import pytest def test_priority_all_sample_variations(): """Test priority determination across all sample variations.""" results = [] for sample in SampleFactory.coverage(): if not sample.is_validated: try: determine_priority(sample) results.append(f"FAIL: {sample.sample_type.value}, validated={sample.is_validated} - expected ValueError") except ValueError: results.append(f"PASS: {sample.sample_type.value}, validated={sample.is_validated} - correctly raised ValueError") elif sample.sample_type == SampleType.CALIBRATION: priority = determine_priority(sample) if priority == "high": results.append(f"PASS: {sample.sample_type.value}, validated={sample.is_validated} -> {priority}") else: results.append(f"FAIL: {sample.sample_type.value} expected 'high', got '{priority}'") else: priority = determine_priority(sample) if priority in ["high", "medium", "normal"]: results.append(f"PASS: {sample.sample_type.value}, validated={sample.is_validated} -> {priority}") else: results.append(f"FAIL: {sample.sample_type.value} got invalid priority '{priority}'") return results # Run the test and display results print("Testing priority determination across all sample variations:") print("-" * 60) for result in test_priority_all_sample_variations(): print(result) print("-" * 60) print(f"All variations tested!")
With a simple call to SampleFactory.coverage(), this single test covers every structural combination of our model. If we add new enum values or optional fields later, the test automatically expands to cover them.
3.4. The Reusable Fixture Pattern
Here's where things get powerful. We can create a reusable pytest fixture that applies this coverage-based testing pattern to any Pydantic model:
import pytest from typing import Type, Iterator, TypeVar from pydantic import BaseModel from polyfactory.factories.pydantic_factory import ModelFactory T = TypeVar("T", bound=BaseModel) def create_factory(model: Type[T]) -> Type[ModelFactory[T]]: """Dynamically create a factory for any Pydantic model.""" return type(f"{model.__name__}Factory", (ModelFactory,), {"__model__": model}) @pytest.fixture def model_coverage(request) -> Iterator[BaseModel]: """ Reusable fixture that yields all structural variations of a model. Usage: @pytest.mark.parametrize("model_class", [Sample, Measurement, Experiment]) def test_serialization(model_coverage, model_class): for instance in model_coverage: assert instance.model_dump_json() """ model_class = request.param factory = create_factory(model_class) yield from factory.coverage() # Now testing ANY model is trivial: @pytest.mark.parametrize("model_class", [Sample, SpectroscopyReading, Experiment]) def test_all_models_serialize(model_class): """Every model variation must serialize to JSON.""" factory = create_factory(model_class) for instance in factory.coverage(): json_str = instance.model_dump_json() restored = model_class.model_validate_json(json_str) assert restored == instance
This pattern is massively scalable. Add a new model to your codebase? Just add it to the parametrize list and you instantly get full structural coverage. The investment in the pattern pays dividends as your codebase grows.
4. Value-Level Testing with Hypothesis testing hypothesis
Polyfactory handles structural variations, but what about value-level edge cases? What happens when wavelength_nm is 0, or negative, or larger than the observable universe? This is where Hypothesis comes in.
Hypothesis is a property-based testing library. Instead of specifying exact test cases, you describe properties that should hold for any valid input, and Hypothesis generates hundreds of random inputs to try to break your code.
4.1. The @given Decorator
The @given decorator tells Hypothesis what kind of data to generate:
from hypothesis import given, strategies as st, settings @given(st.integers()) @settings(max_examples=10) # Limit for demo def test_absolute_value_is_non_negative(n): """Property: absolute value is always >= 0""" assert abs(n) >= 0 @given(st.text()) @settings(max_examples=10) # Limit for demo def test_string_reversal_is_reversible(s): """Property: reversing twice gives original""" assert s[::-1][::-1] == s # Run the tests and show output print("Running Hypothesis tests:") print("-" * 60) try: test_absolute_value_is_non_negative() print("PASS: test_absolute_value_is_non_negative - all generated integers passed") except AssertionError as e: print(f"FAIL: test_absolute_value_is_non_negative - {e}") try: test_string_reversal_is_reversible() print("PASS: test_string_reversal_is_reversible - all generated strings passed") except AssertionError as e: print(f"FAIL: test_string_reversal_is_reversible - {e}") print("-" * 60)
Hypothesis will generate ~100 integers/strings per test run, including edge cases like 0, negative numbers, empty strings, unicode, etc.
4.2. The Chaos Hypothesis Unleashes
Hypothesis doesn't just generate "normal" test data – it actively tries to break your code with the most cursed inputs imaginable. Real-world string inputs are reliably worse than you imagine.
@given(st.text()) def test_sample_notes_field(notes: str): """What could go wrong with a simple notes field?""" sample = Sample( sample_id="test-001", experiment_id="exp-001", notes=notes # Oh no. ) process_sample(sample)
None
Hypothesis will helpfully try:
notes""= – The empty string. Classic.notes"\\x00\\x00\\x00"= – Null bytes. Because why not?notes"🧪🔬🧬💉"= – Your sample notes are now emoji. The lab notebook of the future.notes"a" * 10_000_000= – Ten million 'a's. Hope you're not logging this.notes"\\n\\n\\n\\n\\n"= – Just vibes (and newlines).notes"ñoño"= – Unicode normalization enters the chat.notes"🏳️🌈"= – A single "character" that's actually 6 code points (flag + variation selector + ZWJ + rainbow). Grapheme clusters: surprise!
Your function either handles these gracefully or you discover bugs you never knew you had. Usually the latter.
4.3. Combining Hypothesis with Pydantic
The real power comes from combining Hypothesis with our data models. Hypothesis has a from_type() strategy that can generate instances of Pydantic models:
from hypothesis import given, strategies as st from hypothesis import settings @given(st.from_type(Sample)) @settings(max_examples=20) # Reduced for demo output def test_sample_serialization_roundtrip(sample: Sample): """Property: serializing and deserializing preserves data""" json_str = sample.model_dump_json() restored = Sample.model_validate_json(json_str) assert restored == sample # Run and show output print("Testing Sample serialization roundtrip with Hypothesis:") print("-" * 60) try: test_sample_serialization_roundtrip() print("PASS: All 20 generated Sample instances serialized correctly") except AssertionError as e: print(f"FAIL: {e}") except Exception as e: print(f"ERROR: {type(e).__name__}: {e}") print("-" * 60)
This test generates random valid Sample instances and verifies that JSON serialization works correctly for all of them.
4.4. Custom Strategies for Domain Constraints
Sometimes we need more control over generated values. In scientific domains, this is critical – our data has physical meaning, and randomly generated values often violate physical laws.
Let me show you what I mean with spectroscopy data:
from hypothesis import given, strategies as st, assume # Strategy for wavelengths (must be positive, typically 200-1100nm for UV-Vis) valid_wavelength = st.floats(min_value=200.0, max_value=1100.0, allow_nan=False) # Strategy for temperature (above absolute zero, below plasma) valid_temperature = st.floats(min_value=0.001, max_value=10000.0, allow_nan=False) # Strategy for concentration (non-negative, physically reasonable) valid_concentration = st.one_of( st.none(), st.floats(min_value=0.0, max_value=1000.0, allow_nan=False) # millimolar ) # Strategy for pressure (vacuum to high pressure, in atmospheres) valid_pressure = st.floats(min_value=0.0, max_value=1000.0, allow_nan=False) # Composite strategy with inter-field constraints @st.composite def spectroscopy_reading_strategy(draw): """Generate physically plausible spectroscopy readings.""" wavelength = draw(valid_wavelength) pressure = draw(valid_pressure) # Domain constraint: at very low pressure, temperature readings are unreliable # (this is a real thing in vacuum spectroscopy!) # Draw temperature *conditionally* on pressure rather than drawing-then-filtering, # so we don't burn Hypothesis's example budget on rejected draws. if pressure < 0.01: temperature = draw(st.floats(min_value=100.0, max_value=10000.0, allow_nan=False)) else: temperature = draw(valid_temperature) return SpectroscopyReading( reading_id=draw(st.text(min_size=1, max_size=50).filter(str.strip)), instrument_id=draw(st.sampled_from(["UV-1800", "FTIR-4600", "Raman-532"])), wavelength_nm=wavelength, temperature_K=temperature, pressure_atm=pressure, sample_type=draw(st.sampled_from(SampleType)), is_validated=draw(st.booleans()) ) @given(spectroscopy_reading_strategy()) def test_reading_within_physical_bounds(reading: SpectroscopyReading): """Property: all readings must be physically plausible""" if reading.wavelength_nm is not None: assert reading.wavelength_nm > 0, "Negative wavelength is not a thing" if reading.temperature_K is not None: assert reading.temperature_K > 0, "Below absolute zero? Bold claim."
The key insight here is that scientific data has semantic constraints that go beyond type checking. A float can hold any value, but a wavelength of -500nm or a temperature of -273K is physically impossible. Custom strategies let us encode this domain knowledge.
4.5. Shrinking: Finding Minimal Failing Cases
One of Hypothesis's killer features is shrinking. When it finds a failing test case, it automatically simplifies it to find the minimal example that still fails. Instead of a failing case like:
SpectroscopyReading(reading_id='xK8jP2mQrS...', wavelength_nm=847293.7, temperature_K=9999.9, ...)
Hypothesis will shrink it to something like:
SpectroscopyReading(reading_id='a', wavelength_nm=1101.0, temperature_K=0.0, ...)
This makes debugging much easier – you immediately see that wavelength_nm=1101.0 (just outside our UV-Vis range) is the problem, not the giant random string.
4.6. What Hypothesis Actually Probes
The reason Hypothesis finds bugs naive random sampling misses isn't volume – it's targeting. Under the hood, st.floats() biases its draws toward values that historically break code: boundary values (min, max, just inside, just outside), zero, negative zero, the smallest positive subnormal, inf, -inf, and NaN when allowed. st.text() biases toward the empty string, single characters, surrogate pairs, and combining marks. Each strategy carries a "this is what bugs look like in this domain" prior.
Combine that with shrinking and you get the property-based testing loop: explore aggressively, fail fast, then minimise the failure to something a human can read in one line. Naive uniform random gives you neither.
4.7. When Hypothesis Isn't the Right Tool
Hypothesis is not 'free'. Specific cases where it earns its keep less:
- Stateful workflows with expensive setup. If each example needs a fresh database, an HTTP fixture, or a multi-second container, 100 examples per test blows your CI budget. Use targeted cases or
@reproduce_failurefor regression pinning instead. - Impure code with hidden state. Shrinking assumes failures are reproducible from the shrunk input alone. If the failure depends on global mutable state, you get confusing minimised cases that don't actually fail in isolation.
- Properties you can't actually state. "The function should return the right answer" is not a property. If the only oracle you have is another implementation, you have differential testing, not property testing – and that's a different (still useful) technique.
- When a typed total function would do. A pure function with a tight signature and equivalence partitioning may not need a property at all. Don't reach for Hypothesis as a status symbol.
5. The Testing Gap: When Models Aren't Enough testing gaps
We've covered structural combinations with polyfactory and value-level edge cases with Hypothesis. This is powerful, but there's still a gap: runtime invariants that can't be expressed in the type system.
Consider this example from analytical chemistry:
class CalibrationCurve(BaseModel): readings: list[CalibrationPoint] r_squared: float slope: float intercept: float @field_validator('r_squared') @classmethod def validate_r_squared(cls, v): if not 0 <= v <= 1: raise ValueError('R² must be between 0 and 1') return v
None
Pydantic validates that r_squared is between 0 and 1. But what about this invariant?
The
r_squaredmust be calculated from the actualreadingsusing theslopeandintercept.
This is a cross-field constraint – it depends on the relationship between multiple fields. And it's not just about validation at construction time. What if r_squared gets calculated incorrectly in our curve-fitting logic?
5.1. Scientific Logic Errors
Consider this function:
def recalculate_curve(curve: CalibrationCurve, new_reading: CalibrationPoint) -> CalibrationCurve: """Add a new calibration point and recalculate the curve.""" all_readings = curve.readings + [new_reading] slope, intercept, r_squared = fit_linear_regression(all_readings) # BUG: accidentally swapped slope and intercept return CalibrationCurve( readings=all_readings, r_squared=r_squared, slope=intercept, # BUG: wrong assignment! intercept=slope # BUG: wrong assignment! )
None
This code has a subtle bug: the slope and intercept are swapped. Each field individually is a valid float, so Pydantic validation passes. But any concentration calculated from this curve will be wildly wrong.
Our Pydantic validation passes because each field is individually valid. Our Hypothesis tests might not catch this because they test properties at the data structure level, not scientific invariants.
5.1.1. Why Not Pydantic Validators?
You might be thinking: "Can't we add a @model_validator to Pydantic that checks if r_squared matches the fit?" Technically, yes:
class CalibrationCurve(BaseModel): # ... fields ... @model_validator(mode='after') def validate_r_squared_consistency(self) -> Self: # Check that r_squared matches the actual fit calculated_r2 = compute_r_squared(self.readings, self.slope, self.intercept) if abs(self.r_squared - calculated_r2) > 0.001: raise ValueError("R² doesn't match the fit") return self
But this approach has a significant drawback: custom validators don't serialize to standard schema formats1.
In data engineering, your Pydantic models often need to export schemas for:
- Avro (schema registries for Kafka)
- JSON Schema (API documentation, OpenAPI specs)
- Protobuf (gRPC services)
- Database DDL (SQLAlchemy models, migrations)
These formats support type constraints and basic validation (nullable, enums, numeric ranges), but they have no way to represent arbitrary Python code like "R² must be computed from readings using least-squares regression."
Embedding complex validation logic in your model validators means:
- The schema your consumers see is incomplete – it shows the fields but not the invariants
- Other systems can't validate data independently – they must call your Python code
- Schema evolution becomes fragile – changes to validation logic don't appear in schema diffs
By keeping Pydantic models "schema-clean" (only expressing constraints that can be serialized) and putting cross-field rules runtime contracts (see below), you at least keep your serialized schema honest about what it does and doesn't enforce.
This is where Design by Contract comes in.
6. Design by Contract with icontract testing icontract
icontract brings Design by Contract (DbC) to Python. DbC is a methodology where you specify:
- Preconditions: What must be true before a function runs
- Postconditions: What must be true after a function runs
- Invariants: What must always be true about a class
If any condition is violated at runtime, you get an immediate, informative error.
6.1. Preconditions with @require
Preconditions specify what callers must guarantee:
import icontract @icontract.require(lambda curve: len(curve.readings) >= 2, "Need at least 2 points to fit a curve") @icontract.require(lambda new_reading: new_reading.concentration >= 0, "Concentration must be non-negative") def recalculate_curve(curve: CalibrationCurve, new_reading: CalibrationPoint) -> CalibrationCurve: """Add a new calibration point and recalculate the curve.""" all_readings = curve.readings + [new_reading] slope, intercept, r_squared = fit_linear_regression(all_readings) return CalibrationCurve( readings=all_readings, r_squared=r_squared, slope=slope, intercept=intercept )
None
If someone calls recalculate_curve with only one reading, they get an immediate ViolationError explaining which precondition failed.
6.2. Postconditions with @ensure
Postconditions specify what the function guarantees to return:
import icontract @icontract.ensure(lambda result: 0 <= result.r_squared <= 1, "R² must be between 0 and 1") @icontract.ensure( lambda curve, result: len(result.readings) == len(curve.readings) + 1, "Result must have exactly one more reading" ) @icontract.ensure( lambda result: result.slope != 0 or all(abs(r.response - result.intercept) < 1e-9 for r in result.readings), "Zero slope only valid if all responses equal intercept" ) def recalculate_curve(curve: CalibrationCurve, new_reading: CalibrationPoint) -> CalibrationCurve: """Add a new calibration point and recalculate the curve.""" all_readings = curve.readings + [new_reading] slope, intercept, r_squared = fit_linear_regression(all_readings) return CalibrationCurve( readings=all_readings, r_squared=r_squared, slope=slope, intercept=intercept )
None
Now if our function produces an invalid result – even if it passes Pydantic validation – we catch it immediately.
6.3. Class Invariants with @invariant
For data models, class invariants are particularly powerful. They specify properties that must always hold:
import icontract from pydantic import BaseModel import numpy as np def r_squared_matches_fit(self) -> bool: """Invariant: R² must be consistent with actual readings and coefficients.""" if len(self.readings) < 2: return True # Can't verify with insufficient data concentrations = np.array([r.concentration for r in self.readings]) responses = np.array([r.response for r in self.readings]) predicted = self.slope * concentrations + self.intercept ss_res = np.sum((responses - predicted) ** 2) ss_tot = np.sum((responses - np.mean(responses)) ** 2) calculated_r2 = 1 - (ss_res / ss_tot) if ss_tot > 0 else 1.0 return abs(self.r_squared - calculated_r2) < 0.001 # Allow for floating point @icontract.invariant(lambda self: len(self.readings) >= 2, "Calibration needs at least 2 points") @icontract.invariant(r_squared_matches_fit, "R² must match actual fit quality") class CalibrationCurve(BaseModel): readings: list[CalibrationPoint] r_squared: float slope: float intercept: float class Config: arbitrary_types_allowed = True
None
Now any CalibrationCurve instance that violates our scientific invariant will raise an error immediately – whether it's created directly, returned from a function, or modified anywhere in the system.
6.4. A Complete Example
Let's put it all together with a realistic example from a quality control workflow:
import icontract from pydantic import BaseModel, field_validator from typing import Optional from enum import Enum from datetime import datetime import numpy as np class QCStatus(str, Enum): PENDING = "pending" VALIDATED = "validated" FLAGGED = "flagged" REJECTED = "rejected" APPROVED = "approved" class CalibrationPoint(BaseModel): concentration: float response: float replicate: int = 1 @field_validator('concentration') @classmethod def validate_concentration(cls, v): if v < 0: raise ValueError('Concentration must be non-negative') return v def r_squared_is_consistent(self) -> bool: """Invariant: R² must match the actual fit.""" if len(self.readings) < 2: return True conc = np.array([r.concentration for r in self.readings]) resp = np.array([r.response for r in self.readings]) pred = self.slope * conc + self.intercept ss_res = np.sum((resp - pred) ** 2) ss_tot = np.sum((resp - np.mean(resp)) ** 2) calc_r2 = 1 - (ss_res / ss_tot) if ss_tot > 0 else 1.0 return abs(self.r_squared - calc_r2) < 0.001 def approved_has_good_r_squared(self) -> bool: """Invariant: approved curves must have R² >= 0.99.""" if self.status == QCStatus.APPROVED: return self.r_squared >= 0.99 return True @icontract.invariant(lambda self: len(self.readings) >= 2, "Need at least 2 calibration points") @icontract.invariant(r_squared_is_consistent, "R² must match actual fit quality") @icontract.invariant(approved_has_good_r_squared, "Approved curves need R² >= 0.99") class CalibrationCurve(BaseModel): curve_id: str analyst_id: str readings: list[CalibrationPoint] slope: float intercept: float r_squared: float status: QCStatus = QCStatus.PENDING reviewer_notes: Optional[str] = None created_at: datetime class Config: arbitrary_types_allowed = True # Function with contracts @icontract.require(lambda curve: curve.status == QCStatus.PENDING, "Can only validate pending curves") @icontract.ensure(lambda result: result.status in [QCStatus.VALIDATED, QCStatus.FLAGGED], "Validation must result in validated or flagged status") def validate_curve(curve: CalibrationCurve) -> CalibrationCurve: """Validate a calibration curve based on R² threshold.""" new_status = QCStatus.VALIDATED if curve.r_squared >= 0.99 else QCStatus.FLAGGED return CalibrationCurve( curve_id=curve.curve_id, analyst_id=curve.analyst_id, readings=curve.readings, slope=curve.slope, intercept=curve.intercept, r_squared=curve.r_squared, status=new_status, reviewer_notes=curve.reviewer_notes, created_at=curve.created_at ) @icontract.require(lambda curve: curve.status == QCStatus.VALIDATED, "Can only approve validated curves") @icontract.require(lambda reviewer_notes: reviewer_notes and reviewer_notes.strip(), "Reviewer notes required for approval") @icontract.ensure(lambda result: result.status == QCStatus.APPROVED, "Curve must be approved after approval") @icontract.ensure(lambda result: result.reviewer_notes is not None, "Reviewer notes must be set") def approve_curve(curve: CalibrationCurve, reviewer_notes: str) -> CalibrationCurve: """Approve a validated calibration curve.""" return CalibrationCurve( curve_id=curve.curve_id, analyst_id=curve.analyst_id, readings=curve.readings, slope=curve.slope, intercept=curve.intercept, r_squared=curve.r_squared, status=QCStatus.APPROVED, reviewer_notes=reviewer_notes, created_at=curve.created_at )
None
With this setup:
- You cannot create a
CalibrationCurvethat violates any invariant - You cannot call
validate_curveon a non-pending curve - You cannot call
approve_curvewithout reviewer notes - If any function returns an invalid
CalibrationCurve, you get an immediate error
6.5. Combining Everything
The real power comes from combining all three approaches. Here's a complete test file that demonstrates all three techniques working together:
""" Integration tests demonstrating Polyfactory, Hypothesis, and icontract together. This file is tangled from post-data-model-testing.org and can be run with: pytest test_data_model_integration.py -v """ from typing import Optional from enum import Enum from datetime import datetime import numpy as np import icontract import pytest from pydantic import BaseModel, field_validator from polyfactory.factories.pydantic_factory import ModelFactory from polyfactory import Use from hypothesis import given, strategies as st, settings # ============================================================================= # DOMAIN MODELS (with icontract invariants) # ============================================================================= class QCStatus(str, Enum): PENDING = "pending" VALIDATED = "validated" FLAGGED = "flagged" REJECTED = "rejected" APPROVED = "approved" class CalibrationPoint(BaseModel): concentration: float response: float replicate: int = 1 @field_validator('concentration') @classmethod def validate_concentration(cls, v): if v < 0: raise ValueError('Concentration must be non-negative') return v def r_squared_is_consistent(self) -> bool: """Invariant: R² must match the actual fit.""" if len(self.readings) < 2: return True conc = np.array([r.concentration for r in self.readings]) resp = np.array([r.response for r in self.readings]) pred = self.slope * conc + self.intercept ss_res = np.sum((resp - pred) ** 2) ss_tot = np.sum((resp - np.mean(resp)) ** 2) calc_r2 = 1 - (ss_res / ss_tot) if ss_tot > 0 else 1.0 return abs(self.r_squared - calc_r2) < 0.001 def approved_has_good_r_squared(self) -> bool: """Invariant: approved curves must have R² >= 0.99.""" if self.status == QCStatus.APPROVED: return self.r_squared >= 0.99 return True @icontract.invariant(lambda self: len(self.readings) >= 2, "Need at least 2 calibration points") @icontract.invariant(r_squared_is_consistent, "R² must match actual fit quality") @icontract.invariant(approved_has_good_r_squared, "Approved curves need R² >= 0.99") class CalibrationCurve(BaseModel): curve_id: str analyst_id: str readings: list[CalibrationPoint] slope: float intercept: float r_squared: float status: QCStatus = QCStatus.PENDING reviewer_notes: Optional[str] = None created_at: datetime class Config: arbitrary_types_allowed = True # ============================================================================= # DOMAIN FUNCTIONS (with icontract pre/post conditions) # ============================================================================= @icontract.require(lambda curve: curve.status == QCStatus.PENDING, "Can only validate pending curves") @icontract.ensure(lambda result: result.status in [QCStatus.VALIDATED, QCStatus.FLAGGED], "Validation must result in validated or flagged status") def validate_curve(curve: CalibrationCurve) -> CalibrationCurve: """Validate a calibration curve based on R² threshold.""" new_status = QCStatus.VALIDATED if curve.r_squared >= 0.99 else QCStatus.FLAGGED return CalibrationCurve( curve_id=curve.curve_id, analyst_id=curve.analyst_id, readings=curve.readings, slope=curve.slope, intercept=curve.intercept, r_squared=curve.r_squared, status=new_status, reviewer_notes=curve.reviewer_notes, created_at=curve.created_at ) @icontract.require(lambda curve: curve.status == QCStatus.VALIDATED, "Can only approve validated curves") @icontract.require(lambda reviewer_notes: reviewer_notes and reviewer_notes.strip(), "Reviewer notes required for approval") @icontract.ensure(lambda result: result.status == QCStatus.APPROVED, "Curve must be approved after approval") def approve_curve(curve: CalibrationCurve, reviewer_notes: str) -> CalibrationCurve: """Approve a validated calibration curve.""" return CalibrationCurve( curve_id=curve.curve_id, analyst_id=curve.analyst_id, readings=curve.readings, slope=curve.slope, intercept=curve.intercept, r_squared=curve.r_squared, status=QCStatus.APPROVED, reviewer_notes=reviewer_notes, created_at=curve.created_at ) # ============================================================================= # POLYFACTORY FACTORIES # ============================================================================= class CalibrationPointFactory(ModelFactory): __model__ = CalibrationPoint # Constrain concentration to non-negative values (matching Pydantic validator) concentration = Use(lambda: ModelFactory.__random__.uniform(0, 1000)) class CalibrationCurveFactory(ModelFactory): __model__ = CalibrationCurve @classmethod def build(cls, **kwargs): # Generate readings that produce a valid fit readings = kwargs.get('readings') or [ CalibrationPointFactory.build(concentration=float(i), response=float(i * 2.5 + 1.0)) for i in range(5) ] # Calculate actual fit parameters conc = np.array([r.concentration for r in readings]) resp = np.array([r.response for r in readings]) slope, intercept = np.polyfit(conc, resp, 1) pred = slope * conc + intercept ss_res = np.sum((resp - pred) ** 2) ss_tot = np.sum((resp - np.mean(resp)) ** 2) r_squared = 1 - (ss_res / ss_tot) if ss_tot > 0 else 1.0 return super().build( readings=readings, slope=slope, intercept=intercept, r_squared=r_squared, **{k: v for k, v in kwargs.items() if k not in ['readings', 'slope', 'intercept', 'r_squared']} ) # ============================================================================= # TESTS # ============================================================================= class TestPolyfactoryCoverage: """Tests using Polyfactory's systematic coverage.""" def test_qc_workflow_all_combinations(self): """Test QC workflow with polyfactory coverage - structural edge cases. Note: We iterate over QCStatus values manually because CalibrationCurve has complex invariants (R² consistency, minimum readings) that coverage() can't satisfy automatically. This demonstrates intentional structural coverage of the state machine. """ tested_statuses = [] for status in QCStatus: # Build a valid curve with this status curve = CalibrationCurveFactory.build(status=status) tested_statuses.append(status) if curve.status == QCStatus.PENDING: validated = validate_curve(curve) assert validated.status in [QCStatus.VALIDATED, QCStatus.FLAGGED] if validated.status == QCStatus.VALIDATED: approved = approve_curve(validated, "Meets all QC criteria") assert approved.status == QCStatus.APPROVED assert approved.reviewer_notes is not None # Verify we tested all status values assert set(tested_statuses) == set(QCStatus) class TestHypothesisProperties: """Property-based tests using Hypothesis.""" @given(st.builds( CalibrationPoint, concentration=st.floats(min_value=0, max_value=1000, allow_nan=False), response=st.floats(min_value=0, max_value=10000, allow_nan=False), replicate=st.integers(min_value=1, max_value=10) )) @settings(max_examples=50) def test_calibration_point_concentration_non_negative(self, point: CalibrationPoint): """Hypothesis: concentration must be non-negative.""" assert point.concentration >= 0 @given(st.builds( CalibrationPoint, concentration=st.floats(min_value=0, max_value=1000, allow_nan=False), response=st.floats(min_value=0, max_value=10000, allow_nan=False), replicate=st.integers(min_value=1, max_value=10) )) @settings(max_examples=50) def test_calibration_point_response_is_finite(self, point: CalibrationPoint): """Hypothesis: response values are finite numbers.""" assert np.isfinite(point.response) class TestIcontractInvariants: """Tests verifying icontract catches invalid states.""" def test_contracts_catch_invalid_r_squared(self): """Verify contracts catch scientifically invalid R² values.""" readings = [ CalibrationPoint(concentration=1.0, response=2.5), CalibrationPoint(concentration=2.0, response=5.0), ] # Try to create a curve with fake R² that doesn't match the data with pytest.raises(icontract.ViolationError) as exc_info: CalibrationCurve( curve_id="cal-001", analyst_id="analyst-1", readings=readings, slope=2.5, intercept=0.0, r_squared=0.5, # Wrong! Actual R² is ~1.0 created_at=datetime.now() ) assert "R² must match actual fit quality" in str(exc_info.value) def test_contracts_require_minimum_readings(self): """Verify contracts require at least 2 calibration points.""" with pytest.raises(icontract.ViolationError) as exc_info: CalibrationCurve( curve_id="cal-002", analyst_id="analyst-1", readings=[CalibrationPoint(concentration=1.0, response=2.5)], # Only 1! slope=2.5, intercept=0.0, r_squared=1.0, created_at=datetime.now() ) assert "Need at least 2 calibration points" in str(exc_info.value) def test_validate_requires_pending_status(self): """Verify validate_curve requires pending status.""" readings = [ CalibrationPointFactory.build(concentration=float(i), response=float(i * 2.5 + 1.0)) for i in range(5) ] conc = np.array([r.concentration for r in readings]) resp = np.array([r.response for r in readings]) slope, intercept = np.polyfit(conc, resp, 1) curve = CalibrationCurve( curve_id="cal-003", analyst_id="analyst-1", readings=readings, slope=slope, intercept=intercept, r_squared=1.0, status=QCStatus.VALIDATED, # Not pending! created_at=datetime.now() ) with pytest.raises(icontract.ViolationError) as exc_info: validate_curve(curve) assert "Can only validate pending curves" in str(exc_info.value)
Now we run the tests with pytest:
cd ~/projects/lab-data && poetry run pytest test_data_model_integration.py -vvvv -q --disable-warnings --tb=short 2>&1
============================= test session starts ============================== platform darwin -- Python 3.11.6, pytest-9.0.2, pluggy-1.6.0 -- ~/projects/lab-data/.venv/bin/python cachedir: .pytest_cache hypothesis profile 'default' rootdir: ~/projects/lab-data configfile: pyproject.toml plugins: Faker-37.11.0, hypothesis-6.142.3 collecting ... collected 6 items test_data_model_integration.py::TestPolyfactoryCoverage::test_qc_workflow_all_combinations PASSED [ 16%] test_data_model_integration.py::TestHypothesisProperties::test_calibration_point_concentration_non_negative PASSED [ 33%] test_data_model_integration.py::TestHypothesisProperties::test_calibration_point_response_is_finite PASSED [ 50%] test_data_model_integration.py::TestIcontractInvariants::test_contracts_catch_invalid_r_squared PASSED [ 66%] test_data_model_integration.py::TestIcontractInvariants::test_contracts_require_minimum_readings PASSED [ 83%] test_data_model_integration.py::TestIcontractInvariants::test_validate_requires_pending_status PASSED [100%] ======================== 6 passed, 3 warnings in 1.40s =========================
6.6. Caveats with icontract
An honest caveat: icontract has the same opacity problem as @model_validator. An @invariant is arbitrary Python – it does not serialize to Avro, JSON Schema, or Protobuf either. Moving the logic out of Pydantic doesn't make it visible to downstream consumers; it just stops it from polluting the schema export. If a downstream service in another language needs to enforce "R² matches the fit", you still have to re-implement it there (or push the check into a shared validation service). This is a real limit of any in-process invariant approach – the alternative is consumer-driven contract testing (Pact and friends), which is a different toolchain entirely and out of scope here.
And a runtime cost. @invariant checks run on every method call on the instance, not just construction. The r_squared_is_consistent check below does a numpy polyfit on every invocation; on a hot path (a Kafka consumer processing thousands of messages per second) this is a real cost. Note that python -O does not help here – icontract raises ViolationError unconditionally and does not piggyback on assert. The right knobs are icontract's own enabled argument on each contract, the ICONTRACT_SLOW environment variable for gating expensive checks, or building two configurations of your application. Either way: if you switch contracts off in prod, your "runtime safety net" is only a test-time safety net, and you need to be honest with yourself about that.
6.7. Alternatives to runtime contract enforcement
Alternatives worth weighing. Before reaching for runtime contracts, three other approaches deserve a serious look:
- Schema-first with codegen. Treat Avro/Protobuf/JSON Schema as the source of truth and generate Pydantic (or your language's equivalent) from it. Cross-system drift becomes structurally impossible because there's only one definition. This is the right answer when your data crosses many language boundaries. Its weakness is exactly the one this post is about: schema languages can't express cross-field invariants either, so you still need something for "R² must match the fit."
- Consumer-driven contract testing (Pact and friends). Push the enforcement to the boundary between services rather than inside any one of them. This is the right answer when the question is "do producer and consumer agree?" It's the wrong answer when the question is "is this single object internally coherent?", which is what we have here.
- In-process invariants (icontract,
@model_validator, plain assertions in__init__). Cheapest to add, lives next to the data, and – as discussed above – invisible to anything outside your Python process.
These are not mutually exclusive. A mature system typically has schema-first definitions at the wire, CDC tests at the service boundaries, and in-process invariants for the rules that live entirely inside one bounded context. This post is about that last layer.
7. Conclusion summary
The example here is a calibration curve, but the pattern is not domain-specific. Anywhere a data model carries a rule the type system can't express, the same three layers apply:
- An
Orderwherediscount <subtotal=,taxis a function ofsubtotal - discount, andtotalis the sum. Pydantic accepts any three floats; only a cross-field check rejects the inconsistent invoice. - An
OAuthTokenwhere the grantedscopesmust be a subset of the client'sallowed_scopes, andexpires_at > issued_at. Each field is structurally fine in isolation. - An inventory
StockMovementwhereon_hand_after = on_hand_before + delta. Off-by-one in a service layer produces a "valid" object that silently corrupts every downstream report.
In each case the bug looks the same as the calibration-curve bug: every field passes its own validator, the JSON serialises cleanly, and the wrongness only shows up when you ask whether the fields agree with each other. That is the bug class this stack is for.
The three layers, each catching a different class:
| Technique | Catches | When |
|---|---|---|
| Polyfactory | Structural combinations | Test generation |
| Hypothesis | Value-level edge cases | Test execution |
| icontract | Cross-field invariants | Runtime |
Three independent failure modes, three independent tools. Start with polyfactory's coverage() for structural completeness. Add Hypothesis for value-level probing. Use icontract for invariants that can't be expressed in types – the swapped slope and intercept, the discount > subtotal, the StockMovement that doesn't add up.
8. tldr
TLDR: A modest Pydantic model – a dozen fields with a few enums and optionals – has 7,680 valid structural shapes. Your tests probably cover four of them. This post is a three-layer pattern for closing that gap – and an honest accounting of where each layer does and doesn't earn its keep.
Polyfactory's coverage() automates 1-way structural partition coverage so you stop hand-writing fixtures for "every enum value × every nullable state". Hypothesis adds value-level probing – boundary floats, NaN, unicode – and shrinks failing cases to minimal examples. icontract enforces cross-field invariants (like "R² must match the actual fit") that no type system or serializable schema can express.
The worked example is a scientific calibration curve, with all three tools running in a real pytest file you can copy. The post is also explicit about the limits: equivalence partitioning has been standard since the 1970s, @invariant has the same schema-opacity problem as @model_validator, and runtime contracts cost real CPU on hot paths.
Footnotes:
Data standards are the connective tissue of cross-system integration, and in my experience the right default is to use them all the way down – as the source of truth, not as a downstream artefact. You don't necessarily write the schema files by hand; they can be generated from Python (e.g. from Pydantic models). But if you go that route, it is essential that every constraint your in-memory model enforces also appears in the exported schema. Anything that doesn't survive serialisation is a constraint your downstream consumers cannot see.