Schemas¶

Two ways to define a schema:

schema_of(df): captures the schema of a live DataFrame as a Python type. By default the resulting type uses subset matching (extra columns are fine); pass subset=False to the @enforce decorator or arm() call to require an exact match.
Schema subclass (SparkSchema / PandasSchema / PolarsSchema): declare the contract upfront without a DataFrame. Subset matching by default: extra columns are fine. Child classes inherit all parent fields.

Note

Optional[dtype] is documentation: it signals that nulls are expected in that column. dfguard checks the column dtype but never inspects the data for null presence. A column passes whether it has zero nulls or all nulls. In PySpark strict mode (subset=False), the nullable flag in the schema metadata is compared, but no Spark job is triggered and no data is read.

schema_of¶

PySpark

dfguard.pyspark.dataset.schema_of(df)[source]¶

Capture a DataFrame’s schema as a Python type class.

Returns a class whose isinstance check does exact schema matching: same column names, same types, nothing extra. Assign in PascalCase and use as a type annotation:

RawSchema = schema_of(raw_df)

@enforce
def enrich(df: RawSchema): ...   # wrong schema → raises at call site

Capture a new type at each stage that changes the schema:

EnrichedSchema = schema_of(enriched_df)

Return type:: type[_TypedDatasetBase]
Parameters:: df (Any)

pandas

dfguard.pandas.dataset.schema_of(df)[source]¶

Capture a DataFrame’s schema as a Python type class.

Returns a class whose isinstance check does exact schema matching: same column names, same dtypes, nothing extra. Assign in PascalCase and use as a type annotation:

RawSchema = schema_of(raw_df)

@enforce
def enrich(df: RawSchema): ...   # wrong schema → raises at call site

Capture a new type at each stage that changes the schema:

EnrichedSchema = schema_of(enriched_df)

Return type:: type[_PandasDatasetBase]
Parameters:: df (Any)

Polars

Also accepts pl.LazyFrame.

dfguard.polars.dataset.schema_of(df)[source]¶

Capture the exact schema of df as a type usable with @enforce.

Unlike PolarsSchema (subset matching), the resulting type requires the DataFrame to have exactly the captured columns — no extras.

Works with both pl.DataFrame and pl.LazyFrame.

Return type:: type
Parameters:: df (Any)

Schema class¶

PySpark

from pyspark.sql import types as T
import dfguard.pyspark as dfg
from dfguard.pyspark import Optional

class OrderSchema(dfg.SparkSchema):
    order_id = T.LongType()
    amount   = T.DoubleType()
    tags     = T.ArrayType(T.StringType())
    zip      = Optional[T.StringType()]   # nullable

class EnrichedSchema(OrderSchema):       # inherits all fields
    revenue = T.DoubleType()

class dfguard.pyspark.schema.SparkSchema[source]¶

Declare a DataFrame’s expected shape as a Python class.

Use this when you want to write down a schema without a live DataFrame. SparkSchema uses subset matching: the DataFrame must have every declared field, but extra columns are fine. This is the opposite of schema_of(df), which requires an exact match (no extra columns).

from pyspark.sql import types as T
from typing import Optional
from dfguard.pyspark import SparkSchema, enforce

class OrderSchema(SparkSchema):
    order_id: T.LongType()
    amount:   T.DoubleType()
    quantity: T.IntegerType()
    zip:      Optional[T.StringType()]   # nullable field

@enforce
def process(df: OrderSchema): ...

# A DataFrame with only these columns passes.
# A DataFrame with extra columns also passes (subset matching).
# A DataFrame missing 'order_id' raises immediately.

Child classes inherit all parent fields:

class EnrichedSchema(OrderSchema):
    revenue: T.DoubleType()   # adds revenue, keeps order_id/amount/quantity/zip

Note: use Optional[T.XxxType()] for nullable fields. PySpark DataType instances do not support the X | None union syntax. SubSchema | None works when the field type is a nested SparkSchema subclass (a Python class, not a DataType instance).

classmethod to_struct()[source]¶

Return the StructType for this schema. Result is cached after the first call.

Return type:: Any

classmethod from_struct(struct, name='GeneratedSchema')[source]¶

Create a new SparkSchema subclass from a live StructType.

Nested StructTypes are recursively converted to their own SparkSchema subclasses. Useful for round-tripping or generating typed wrappers for DataFrames whose schema is only known at runtime.

Return type:

type[SparkSchema]

Parameters:

struct (Any)
name (str)

classmethod validate(df_or_schema, subset=True)[source]¶

Compare a DataFrame or StructType against this schema.

Returns a list of errors; an empty list means the schema is valid. When subset=True (default), extra columns beyond the declared schema are fine. When subset=False, extra columns are also reported as errors.

Return type:

list[_SchemaError]

Parameters:

df_or_schema (Any)
subset (bool)

classmethod assert_valid(df_or_schema, subset=True, history=None)[source]¶

Like validate but raises SchemaValidationError on the first failure.

Return type:

None

Parameters:

df_or_schema (Any)
subset (bool)
history (Any)

classmethod empty(spark)[source]¶

Return an empty typed dataset instance with this schema and zero rows.

Return type:: Any
Parameters:: spark (Any)

classmethod to_code()[source]¶

Generate valid Python source code for this schema class.

Return type:: str

classmethod diff(other)[source]¶

Return a human-readable diff between two SparkSchema classes.

Return type:: str
Parameters:: other (type[SparkSchema])

pandas

import numpy as np
import pandas as pd
import dfguard.pandas as dfg
from dfguard.pandas import Optional

class OrderSchema(dfg.PandasSchema):
    order_id = np.dtype("int64")
    amount   = np.dtype("float64")
    label    = pd.StringDtype()
    zip      = Optional[pd.StringDtype()]

class EnrichedSchema(OrderSchema):
    revenue = np.dtype("float64")

class dfguard.pandas.schema.PandasSchema[source]¶

Declare a pandas DataFrame’s expected shape as a Python class.

Use this when you want to write down a schema without a live DataFrame. PandasSchema uses subset matching: the DataFrame must have every declared column, but extra columns are fine. This mirrors SparkSchema’s contract: declare what matters, ignore what doesn’t.

Annotation form (standard):

import numpy as np
import pandas as pd
from dfguard.pandas import PandasSchema, Optional, enforce

class OrderSchema(PandasSchema):
    order_id:  np.dtype("int64")
    amount:    np.dtype("float64")
    name:      pd.StringDtype()
    active:    pd.BooleanDtype()
    tags:      list[str]                      # object dtype, holds str lists
    zip_code:  Optional[pd.StringDtype()]     # nullable

@enforce
def process(df: OrderSchema): ...

Assignment form (avoids Pylance reportInvalidTypeForm warnings):

class OrderSchema(PandasSchema):
    order_id = np.dtype("int64")
    amount   = np.dtype("float64")
    name     = pd.StringDtype()

Column-name aliasing (when the column name is not a valid identifier):

from dfguard.pandas import PandasSchema, alias

class OrderSchema(PandasSchema):
    first_name    = alias("First Name",   np.dtype("object"))
    order_id      = alias("order-id",     np.dtype("int64"))
    revenue       = np.dtype("float64")   # no alias needed

Child classes inherit all parent fields:

class EnrichedSchema(OrderSchema):
    revenue: np.dtype("float64")   # adds revenue, keeps all parent fields

Nullable columns¶

Use Optional[dtype] from dfguard.pandas for dtype instances (e.g. pd.StringDtype()) since typing.Optional rejects non-type args. For numpy scalar types (e.g. np.int64), both work: Optional[np.int64] and the native np.int64 | None syntax.

Pandas extension dtypes (pd.Int64Dtype(), pd.StringDtype(), etc.) are inherently nullable and do not require wrapping in Optional.

Supported annotations¶

numpy dtype instances: np.dtype("int64"), np.dtype("float64")
numpy scalar types: np.int64, np.float32, np.bool_
pandas extension dtypes: pd.StringDtype(), pd.Int64Dtype(), pd.BooleanDtype()
Python builtins: int, float, bool, str
Python generic types: list[str], dict[str, Any] (object dtype)
Datetime: datetime.datetime, pd.Timestamp

Note

Requires pandas >= 2.0.

classmethod to_dtype_dict()[source]¶

Return {column_name: dtype} for this schema. Cached after first call.

Return type:: dict[str, Any]

classmethod to_struct()¶

Return {column_name: dtype} for this schema. Cached after first call.

Return type:: dict[str, Any]

classmethod from_struct(dtypes, name='GeneratedSchema')¶

Create a PandasSchema subclass from a {col: dtype} dict.

Column names that are not valid Python identifiers are automatically sanitized and wrapped with alias().

Return type:

type[PandasSchema]

Parameters:

dtypes (dict[str, Any])
name (str)

classmethod validate(df, subset=True)[source]¶

Compare a DataFrame against this schema.

Returns a list of errors; an empty list means the schema is valid. When subset=True (default), extra columns are fine. When subset=False, extra columns are also reported as errors.

Return type:

list[_SchemaError]

Parameters:

df (Any)
subset (bool)

classmethod assert_valid(df, subset=True, history=None)[source]¶

Like validate but raises SchemaValidationError on failure.

Return type:

None

Parameters:

df (Any)
subset (bool)
history (Any)

classmethod empty()[source]¶

Return an empty DataFrame with this schema and zero rows.

Return type:: Any

classmethod to_code()[source]¶

Generate valid Python source code for this schema class.

Return type:: str

classmethod diff(other)[source]¶

Return a human-readable diff between two PandasSchema classes.

Return type:: str
Parameters:: other (type[PandasSchema])

Polars

import polars as pl
import dfguard.polars as dfg
from dfguard.polars import Optional

class OrderSchema(dfg.PolarsSchema):
    order_id = pl.Int64
    amount   = pl.Float64
    tags     = pl.List(pl.String)
    zip      = Optional[pl.String]

class EnrichedSchema(OrderSchema):
    revenue = pl.Float64

class dfguard.polars.schema.PolarsSchema[source]¶

Declare a Polars DataFrame’s expected shape as a Python class.

Uses subset matching: the DataFrame must have every declared column with matching dtype, but extra columns are fine. Use schema_of(df) for exact matching (no extra columns allowed).

Annotation form (standard):

import polars as pl
from dfguard.polars import PolarsSchema, Optional, enforce

class OrderSchema(PolarsSchema):
    order_id: pl.Int64
    amount:   pl.Float64
    tags:     pl.List(pl.String)   # first-class nested type
    name:     Optional[pl.String]  # explicitly nullable

@enforce
def process(df: OrderSchema): ...

Assignment form (avoids Pylance reportInvalidTypeForm on instance-form dtypes):

class OrderSchema(PolarsSchema):
    order_id = pl.Int64
    tags     = pl.List(pl.String)

Column-name aliasing:

from dfguard.polars import PolarsSchema, alias

class OrderSchema(PolarsSchema):
    first_name = alias("First Name", pl.String)
    order_id   = alias("order-id",   pl.Int64)

Child classes inherit all parent fields:

class EnrichedSchema(OrderSchema):
    revenue: pl.Float64

Python builtins and generics¶

int -> pl.Int64, float -> pl.Float64, str -> pl.String, bool -> pl.Boolean, bytes -> pl.Binary. list[T] -> pl.List(T). dict -> pl.Object.

Nullability¶

All Polars columns are physically nullable. Use Optional[T] to declare that nulls are expected. X | None (Python 3.11 native union) also works.

Note

Requires Polars >= 1.0.

classmethod to_struct()[source]¶

Return {col: polars_dtype} for this schema. Cached after first call.

Return type:: dict[str, Any]

classmethod from_struct(struct, name='GeneratedSchema')[source]¶

Create a PolarsSchema subclass from a {col: polars_dtype} dict.

Column names that are not valid Python identifiers are automatically sanitized and wrapped with alias().

Return type:

type[PolarsSchema]

Parameters:

struct (dict[str, Any])
name (str)

classmethod validate(df, subset=True)[source]¶

Compare a DataFrame or LazyFrame against this schema.

Returns a list of errors; empty means valid. subset=True (default): extra columns are fine. subset=False: extra columns are also errors.

Return type:

list[_SchemaError]

Parameters:

df (Any)
subset (bool)

classmethod assert_valid(df, subset=True, history=None)[source]¶

Like validate but raises SchemaValidationError on failure.

Return type:

None

Parameters:

df (Any)
subset (bool)
history (Any)

classmethod empty()[source]¶

Return an empty Polars DataFrame with this schema and zero rows.

Return type:: Any

classmethod to_code()[source]¶

Generate valid Python source code for this schema class.

Return type:: str

classmethod diff(other)[source]¶

Return a human-readable diff between two PolarsSchema classes.

Return type:: str
Parameters:: other (type[PolarsSchema])