Enforcement¶
dfguard enforces schema annotations at the function call site. There are two ways to enable this. Pick one; you do not need both.
arm() / disarm()¶
arm() is the preferred approach for packages. Call it once from your entry
point and every annotated function in the entire package is enforced
automatically. No decorator on each function.
disarm() silences all enforcement globally. Useful in tests where you want
to exercise transform logic without schema-valid fixtures.
# my_pipeline/__init__.py
import dfguard.pyspark as dfg
dfg.arm() # subset=True: extra columns fine
# dfg.arm(subset=False) # strict: exact match everywhere
- dfguard.pyspark._enforcement.arm(module=None, *, package=None, subset=True)[source]¶
Arm the entire calling package and set the global subset default.
Call once from your entry point,
__init__.py, orsettings.py(Kedro):import dfguard.pyspark as dfg dfg.arm() # subset=True (default): extra columns are fine dfg.arm(subset=False) # exact match: no extra columns allowed anywhere
The
subsetvalue becomes the global default. Individual functions decorated with@dfg.enforce(subset=...)override it for that function only.If called when already armed, re-enables enforcement (sets
_ENABLED = True) without re-walking the package.Specific module object:
dfg.arm(my_module)
Explicit package name:
dfg.arm(package="my_pipeline.nodes")
# my_pipeline/__init__.py
import dfguard.pandas as dfg
dfg.arm()
- dfguard.pandas._enforcement.arm(module=None, *, package=None, subset=True)¶
# my_pipeline/__init__.py
import dfguard.polars as dfg
dfg.arm()
- dfguard.polars._enforcement.arm(module=None, *, package=None, subset=True)¶
arm() has no effect and emits a warning when called from __main__
(a file run directly as a script). Use @enforce there instead.
@enforce¶
A per-function decorator for scripts and notebooks. Only checks parameters annotated with a schema type; all other arguments pass through untouched.
@dfg.enforce
def enrich(df: OrderSchema, label: str, limit: int = 10):
# only df is checked; label and limit are not touched
return df.withColumn("revenue", F.col("amount") * F.col("quantity"))
- dfguard.pyspark._enforcement.enforce(func=None, *, subset=<object object>)[source]¶
- Overloads:
func (F) → F
func (None), subset (bool) → Callable[[F], F]
- Parameters:
func (F | None)
subset (Any)
- Return type:
F | Callable[[F], F]
Validate schema annotations on DataFrame arguments.
Only intercepts parameters annotated with a
dfg.schema_oftype or adfg.SparkSchemasubclass. All other arguments are left completely alone.Default: inherits the global
subsetset bydfg.arm():@dfg.enforce def process(df: OrderSchema, label: str): …
subset=True: extra columns in the DataFrame are fine (overrides global):
@dfg.enforce(subset=True) def process(df: OrderSchema): ...
subset=False: DataFrame must match the schema exactly (overrides global):
@dfg.enforce(subset=False) def process(df: OrderSchema): ...
@dfg.enforce
def enrich(df: OrderSchema, label: str):
return df.assign(revenue=df["amount"] * df["quantity"])
@dfg.enforce
def enrich(df: OrderSchema, label: str):
return df.with_columns(revenue=pl.col("amount") * pl.col("quantity"))
The subset flag¶
Both arm() and @enforce accept a subset parameter.
subset=True(default): declared columns must be present with correct types; extra columns in the DataFrame are fine.subset=False: exact match required; extra columns are also an error.
arm(subset=False) sets the global default. @enforce(subset=True) overrides
it for that function only. Function level always wins.
schema_of(df) types always use exact matching, regardless of subset.