Dec 26, 2025

Part 3. Data Validation with Pytest

Continue the data validation following prior one now using pytest. Before using pytest, I have written a standard python script to check my data. It works well actually but I thought I wanted to see how it works with pytest.

Here’s some of my learnings.

Key Syntax & Concepts

Fixtures (`@pytest.fixture`)

Fixtures are setup functions, and I can use the fixture to avoid in-place data updates, which happens a lot in DS work.

scope="session": This is critical for performance. It tells Pytest to load the dataframe once and keep it in memory for the entire test run.
Without this: Pytest would reload the Parquet file 50 times if I have 50 tests.
autouse=True: Automatically applies the fixture to every test without needing to request it as an argument. I used this for the Polars display config so tables always look nice in logs.

1
@pytest.fixture(scope="session")
2
def df():
3
    # only reads disk once!
4
    return pl.read_parquet("sample_data.parquet")

Parametrization (`@pytest.mark.parametrize`)

This is the engine of the validation suite. It allows us to write one test function but run it multiple times with different inputs.

How I used it:

Generate the list of cases: I used list comprehensions to extract rules from the YAML schema.

1
# creates a list like: [('age', 0), ('salary', 20000)]
2
min_val_rules = get_rules("min")

Decorate the test:

1
@pytest.mark.parametrize("col, min_val", min_val_rules)
2
def test_min_val(df, col, min_val):
3
    # this function runs once for every item in min_val_rules
4
    ...

Syntax Deep Dive: `@pytest.mark.parametrize`

The syntax can be confusing because it links strings to function arguments.

There are two main arguments you must provide:

argnames (String): A comma-separated string identifying the variable names.
argvalues (List): A list of data.

1
import pytest
2

3
#                       (1) THE NAMES              (2) THE DATA
4
#                            │                          │
5
#                            ▼                          ▼
6
@pytest.mark.parametrize("param_name, another_param", [
7
    (value1_a, value1_b),
8
    (value2_a, value2_b),
9
    # ... more test cases
10
])
11

12
def test_function(param_name, another_param):
13
#                        ▲       ▲
14
#                        │       │
15
#                 (3) MUST MATCH EXACTLY
16
    assert param_name + another_param > 0

Best Practices for Data Validation

Fail Fast vs. Collect Errors: Pytest collects errors by default (good).
Schema as Code: Keeping rules in schema.yaml makes it readable for non-coders (like PMs or stakeholders).
Sanity Check the Fixture: My df fixture includes a try/except block. If the input file doesn’t exist, pytest.fail stops the whole suite immediately, saving time.
Display Settings: Setting pl.Config.set_tbl_rows(20) in the autouse fixture ensures that if I print a dataframe in a failed test for debugging, I can actually see the data in the CI/CD logs.

Next Step: “Source of Truth” Comparison (The Anti-Join Pattern)

When validating a dataset against a “Gold Source” (or previous production’s run), we want to avoid slow Python loops. In Polars, the efficient way to do “record-by-record” comparison is using Anti-Joins.

Concept: An anti join returns rows from the left dataframe that verify false against the right dataframe.

The Strategy

Check 1 (completeness): Do I have keys that shouldn’t be there? (Anti-join on Primary Key).
- This sometimes will become: I want to test these 10k observations (provided with the same common key)
Check 2 (correctness): Do I have rows where the data doesn’t match the source? (Anti-join on All Columns or sometimes a subset of the columns).

Code Implementation

1
@pytest.fixture(scope="session")
2
def source_df():
3
    # load the "gold source" or "truth" file
4
    return pl.read_parquet("source_of_truth.parquet")
5

6
def test_reconciliation_exact_match(df, source_df):
7
    """
8
    Validates that rows in 'df' match 'source_df' exactly for common keys.
9
    """
10
    # define your primary keys (what makes a row unique?)
11
    primary_keys = ["id", "date"]
12

13
    # define value columns (what data are we comparing?)
14
    # getting intersection of columns ensures we only compare what exists in both
15
    value_cols = [c for c in df.columns if c in source_df.columns and c not in primary_keys]
16

17
    # ---------------------------------------------------------
18
    # CHECK 1: Unexpected Records (Phantom Keys)
19
    # Are there IDs in my new data that don't exist in the source?
20
    # ---------------------------------------------------------
21
    unexpected_rows = df.join(source_df, on=primary_keys, how="anti")
22

23
    assert unexpected_rows.height == 0, \
24
        f"Found {unexpected_rows.height} records with IDs not present in Source of Truth.\n{unexpected_rows.head()}"
25

26
    # ---------------------------------------------------------
27
    # CHECK 2: Value Mismatches
28
    # Join on EVERYTHING. If a row remains, it means the combination
29
    # of (Key + Values) in 'df' was not found in 'source_df'.
30
    # ---------------------------------------------------------
31
    # Note: This finds rows that are "different", but implies the key exists
32
    # (since we passed Check 1).
33
    comparison_cols = primary_keys + value_cols
34

35
    mismatched_rows = df.join(source_df, on=comparison_cols, how="anti")
36

37
    # If this fails, it prints the rows from 'df' that are wrong
38
    assert mismatched_rows.height == 0, \
39
        f"Found {mismatched_rows.height} records where values differ from Source.\n{mismatched_rows.head()}"

Full Code Annotation & Design Patterns

Summary of Approach

Config-Driven Testing: The logic is decoupled from the data. Adding a new column check requires editing schema.yaml, not the Python code.
Performance First: The df fixture uses scope="session" to load the Parquet file exactly once, rather than reloading it for every single test case (which would be slow for large datasets).
- This sometimes could be just the pytest way, and I feel I see it sometimes practical to just use a global variable seems fine.
Dynamic Parametrization: We generate test cases programmatically. Pytest sees the list of rules before it even runs the tests, allowing it to report “50 tests passed” rather than “1 loop passed”.

Annotated Implementation

1
import pytest
2
import polars as pl
3
import yaml
4

5
# NOTE: Good UX tweak.
6
# 'autouse=True' means I don't need to pass this into every test function.
7
# Setting ASCII tables ensures that if a test fails in a CI/CD pipeline (like GitHub Actions),
8
# the dataframe printout remains readable in the text logs.
9
@pytest.fixture(scope="session", autouse=True)
10
def set_polars_display_settings():
11
    pl.Config.set_ascii_tables(True)
12
    pl.Config.set_tbl_rows(20)
13

14
# NOTE: Global load.
15
# We load the schema outside fixtures so 'parametrize' decorators can access it
16
# during "collection phase" (before tests run).
17
with open("schema.yaml") as f:
18
    SCHEMA = yaml.safe_load(f)
19

20
@pytest.fixture(scope="session")
21
def df():
22
    try:
23
        return pl.read_parquet("sample_data.parquet")
24
    except Exception as e:
25
        # TRICK: Fail fast.
26
        # If the file is missing, don't just error out; explicitly fail the test suite
27
        # with a clear message. This stops Pytest from trying to run 100 tests on a NoneType.
28
        pytest.fail(f"test fail due to {e}")
29

30
# ... (get_pl_dtype helper) ...
31

32
# NOTE: Generator Pattern.
33
# We create a list of tuples [(col, props), ...] here.
34
# Pytest uses this list to generate individual test cases.
35
dtype_cases = [(col, props) for col, props in SCHEMA["columns"].items()]
36

37
@pytest.mark.parametrize("col, props", dtype_cases)
38
def test_column_presence_type(col, props, df):
39
    # Check 1: Existence
40
    assert col in df.columns, f"column missing: {col}"
41

42
    # Check 2: Type Safety
43
    # We use a list for expected_dtype (e.g., [Int32, Int64]) to allow flexibility
44
    # because Polars might infer Int32 while we are okay with Int64.
45
    expected_dtype = get_pl_dtype(props["dtype"])
46
    actual_dtype = df[col].dtype
47

48
    assert actual_dtype in expected_dtype, \
49
        f"Type mismatch: {col}. Got {actual_dtype}, expected one of {expected_dtype}"
50

51
# ... (get_rules helper) ...
52

53
@pytest.mark.parametrize("col, min_val", get_rules("min"))
54
def test_min_val(df, col, min_val):
55
    # PERFORMANCE TIP:
56
    # Use Polars native .min() (Rust engine) instead of iterating rows in Python.
57
    actual = df[col].min()
58
    assert actual >= min_val, f"Column '{col}' min check failed."
59

60
# ... (max check) ...
61

62
# NOTE: Explicit is better than implicit.
63
# We explicitly look for 'nullable: False' rather than assuming default is True.
64
nullable_cases = [
65
    (col, props["nullable"])
66
    for col, props in SCHEMA["columns"].items()
67
    if "nullable" in props and props["nullable"] is False
68
]
69

70
@pytest.mark.parametrize("col, is_nullable", nullable_cases)
71
def test_no_nulls_allowed(df, col, is_nullable):
72
    # Polars .null_count() is essentially instant vs looping in Python.
73
    null_count = df[col].null_count()
74
    assert null_count == 0, f"Column '{col}' has {null_count} nulls."
75

76
@pytest.mark.parametrize("col, allowed_values", get_rules("allowed_values"))
77
def test_allowed_values(df, col, allowed_values):
78
    # ALGORITHM: Set Difference.
79
    # Instead of checking "if x in allowed", we do "Set(Actual) - Set(Allowed)".
80
    # Any remainder implies an illegal value.
81
    expected_values = set(allowed_values)
82
    actual_values = set(df[col].unique())
83
    missings = actual_values - expected_values
84

85
    assert len(missings) == 0, \
86
        f"Column '{col}' has illegal values: {missings}"

Attachment

Files used:

Python script
Schema file