Dec 21, 2025

Part 2. Continue on Data Validation Pipeline with YAML and Polars

Context

In addition to the data validation check, sometimes we want to do data conversions so that the downstream process can actuall work.

The idea is to add an additional step, and check for alias of a variable if it’s specified in the .yaml schema file.

Updated YAML `schema.yaml`

1
schema:
2
  u_id:
3
    type: integer
4
    # aliases: ["id", "user_id"]
5
    range: [1, 200]
6

7
  mssubclass_v1:
8
    type: integer
9
    aliases: ["mssubclass"]
10
    allowed_value: [60, 20, 70, 80]

A few things:

By-default, if there’s an alias, the data type conversion will be performed.
- Simply comment it off, then the conversion won’t happen.
Obviously, there can be another step here to validate the .yaml schema file itself.

Detailed yaml file with a few other validation scenarios can be found: here

Updated `validate.py`

In the validation script, we will add a class to perform the data renaming and type conversion.

1
class DataHarmonizer:
2
    def __init__(self, schema_path: str):
3
        self.schema = self._load_schema(schema_path).get("schema", {})
4

5
        # when the schema says it's a `string`
6
        # then it will check if it's `pl.String`
7

8
        self.type_map = {
9
            "string": pl.String,
10
            "float": pl.Float64,
11
            "integer": pl.Int64,
12
            "date": pl.Date,
13
            "boolean": pl.Boolean,
14
        }
15
...
16
    def _load_schema(self, schema_path):
17
        ...
18
        return yaml.safe_load(f)
19

20

21
    def harmonize(self, pl.DataFrame):
22
        ...
23
        rename_map = {}
24

25
        df_cols = set(df.columns)
26

27
        if col, props in self.schema.items():
28
            # this only happens when there's column not found
29
            if col not in df_cols:
30
                aliases = props.get("aliases", [])
31

32
                for alias in aliases:
33
                    # if any of the alias found in the df columns, then the rename will happen
34
                    if alias in df_cols:
35
                        rename_map[alias] = col
36
                        break # stops after finding the first match
37

38
        if rename_map:
39
            df = df.rename(rename_map)
40

41
        # here the type conversion happens
42
        for col, props in self.schema.items():
43
            if col in df.columns:
44
                # find the pl type
45
                target_type_str = props.get("type")
46
                target_pl_type = self.type_map(target_type_str)
47

48
                # compare if current type and target pl type are the same
49
                if target_pl_type:
50
                    current_type = df[col].dtype
51

52
                    if current_type != target_pl_type:
53

54
                        # before conversion happens, we will track the null count and our failure criterial
55
                        nulls_before = df[col].null_count()
56

57
                        # strick = False will make sure it runs without crashing
58
                        df = df.with_columns(pl.col(col).cast(target_pl_type, strict=False))
59

60
                        nulls_after = df[col].null_count()
61

62
                        # check failure count
63
                        failed_rows = nulls_after - nulls_before
64

65
                        if failed_rows > 0:
66
                            ... log here...

The new class will look for columns that need to be converted, based on the aliases attributes in the schema file.

Then after this, the validator class would much stay the same. The DataHarmonizer class actually makes it easier.

Full Python file can be found: here

Next Steps

Use a utility to check the validity of the schema.yaml file
Bring in argparser if i need to

Part 2. Continue on Data Validation Pipeline with YAML and Polars

Context

Updated YAML schema.yaml

Updated validate.py

Next Steps

Updated YAML `schema.yaml`

Updated `validate.py`