Schemas¶
Two ways to define a schema:
schema_of(df): captures the schema of a live DataFrame as a Python type. By default the resulting type uses subset matching (extra columns are fine); passsubset=Falseto the@enforcedecorator orarm()call to require an exact match.Schema subclass (
SparkSchema/PandasSchema/PolarsSchema): declare the contract upfront without a DataFrame. Subset matching by default: extra columns are fine. Child classes inherit all parent fields.
Note
Optional[dtype] is documentation: it signals that nulls are expected in
that column. dfguard checks the column dtype but never inspects the data for
null presence. A column passes whether it has zero nulls or all nulls. In
PySpark strict mode (subset=False), the nullable flag in the schema
metadata is compared, but no Spark job is triggered and no data is read.
schema_of¶
- dfguard.pyspark.dataset.schema_of(df)[source]¶
Capture a DataFrame’s schema as a Python type class.
Returns a class whose
isinstancecheck does exact schema matching: same column names, same types, nothing extra. Assign in PascalCase and use as a type annotation:RawSchema = schema_of(raw_df) @enforce def enrich(df: RawSchema): ... # wrong schema → raises at call site
Capture a new type at each stage that changes the schema:
EnrichedSchema = schema_of(enriched_df)
- dfguard.pandas.dataset.schema_of(df)[source]¶
Capture a DataFrame’s schema as a Python type class.
Returns a class whose
isinstancecheck does exact schema matching: same column names, same dtypes, nothing extra. Assign in PascalCase and use as a type annotation:RawSchema = schema_of(raw_df) @enforce def enrich(df: RawSchema): ... # wrong schema → raises at call site
Capture a new type at each stage that changes the schema:
EnrichedSchema = schema_of(enriched_df)
Also accepts pl.LazyFrame.
Schema class¶
from pyspark.sql import types as T
import dfguard.pyspark as dfg
from dfguard.pyspark import Optional
class OrderSchema(dfg.SparkSchema):
order_id = T.LongType()
amount = T.DoubleType()
tags = T.ArrayType(T.StringType())
zip = Optional[T.StringType()] # nullable
class EnrichedSchema(OrderSchema): # inherits all fields
revenue = T.DoubleType()
- class dfguard.pyspark.schema.SparkSchema[source]¶
Declare a DataFrame’s expected shape as a Python class.
Use this when you want to write down a schema without a live DataFrame.
SparkSchemauses subset matching: the DataFrame must have every declared field, but extra columns are fine. This is the opposite ofschema_of(df), which requires an exact match (no extra columns).from pyspark.sql import types as T from typing import Optional from dfguard.pyspark import SparkSchema, enforce class OrderSchema(SparkSchema): order_id: T.LongType() amount: T.DoubleType() quantity: T.IntegerType() zip: Optional[T.StringType()] # nullable field @enforce def process(df: OrderSchema): ... # A DataFrame with only these columns passes. # A DataFrame with extra columns also passes (subset matching). # A DataFrame missing 'order_id' raises immediately.
Child classes inherit all parent fields:
class EnrichedSchema(OrderSchema): revenue: T.DoubleType() # adds revenue, keeps order_id/amount/quantity/zip
Note: use
Optional[T.XxxType()]for nullable fields. PySpark DataType instances do not support theX | Noneunion syntax.SubSchema | Noneworks when the field type is a nested SparkSchema subclass (a Python class, not a DataType instance).- classmethod to_struct()[source]¶
Return the
StructTypefor this schema. Result is cached after the first call.- Return type:
- classmethod from_struct(struct, name='GeneratedSchema')[source]¶
Create a new SparkSchema subclass from a live
StructType.Nested StructTypes are recursively converted to their own SparkSchema subclasses. Useful for round-tripping or generating typed wrappers for DataFrames whose schema is only known at runtime.
- Return type:
- Parameters:
- classmethod validate(df_or_schema, subset=True)[source]¶
Compare a DataFrame or StructType against this schema.
Returns a list of errors; an empty list means the schema is valid. When
subset=True(default), extra columns beyond the declared schema are fine. Whensubset=False, extra columns are also reported as errors.
- classmethod assert_valid(df_or_schema, subset=True, history=None)[source]¶
Like
validatebut raisesSchemaValidationErroron the first failure.
- classmethod empty(spark)[source]¶
Return an empty typed dataset instance with this schema and zero rows.
- classmethod to_code()[source]¶
Generate valid Python source code for this schema class.
- Return type:
- classmethod diff(other)[source]¶
Return a human-readable diff between two SparkSchema classes.
- Return type:
- Parameters:
other (type[SparkSchema])
import numpy as np
import pandas as pd
import dfguard.pandas as dfg
from dfguard.pandas import Optional
class OrderSchema(dfg.PandasSchema):
order_id = np.dtype("int64")
amount = np.dtype("float64")
label = pd.StringDtype()
zip = Optional[pd.StringDtype()]
class EnrichedSchema(OrderSchema):
revenue = np.dtype("float64")
- class dfguard.pandas.schema.PandasSchema[source]¶
Declare a pandas DataFrame’s expected shape as a Python class.
Use this when you want to write down a schema without a live DataFrame.
PandasSchemauses subset matching: the DataFrame must have every declared column, but extra columns are fine. This mirrorsSparkSchema’s contract: declare what matters, ignore what doesn’t.Annotation form (standard):
import numpy as np import pandas as pd from dfguard.pandas import PandasSchema, Optional, enforce class OrderSchema(PandasSchema): order_id: np.dtype("int64") amount: np.dtype("float64") name: pd.StringDtype() active: pd.BooleanDtype() tags: list[str] # object dtype, holds str lists zip_code: Optional[pd.StringDtype()] # nullable @enforce def process(df: OrderSchema): ...
Assignment form (avoids Pylance
reportInvalidTypeFormwarnings):class OrderSchema(PandasSchema): order_id = np.dtype("int64") amount = np.dtype("float64") name = pd.StringDtype()
Column-name aliasing (when the column name is not a valid identifier):
from dfguard.pandas import PandasSchema, alias class OrderSchema(PandasSchema): first_name = alias("First Name", np.dtype("object")) order_id = alias("order-id", np.dtype("int64")) revenue = np.dtype("float64") # no alias needed
Child classes inherit all parent fields:
class EnrichedSchema(OrderSchema): revenue: np.dtype("float64") # adds revenue, keeps all parent fields
Nullable columns¶
Use
Optional[dtype]fromdfguard.pandasfor dtype instances (e.g.pd.StringDtype()) sincetyping.Optionalrejects non-type args. For numpy scalar types (e.g.np.int64), both work:Optional[np.int64]and the nativenp.int64 | Nonesyntax.Pandas extension dtypes (
pd.Int64Dtype(),pd.StringDtype(), etc.) are inherently nullable and do not require wrapping inOptional.Supported annotations¶
numpy dtype instances:
np.dtype("int64"),np.dtype("float64")numpy scalar types:
np.int64,np.float32,np.bool_pandas extension dtypes:
pd.StringDtype(),pd.Int64Dtype(),pd.BooleanDtype()Python builtins:
int,float,bool,strPython generic types:
list[str],dict[str, Any](object dtype)Datetime:
datetime.datetime,pd.Timestamp
Note
Requires pandas >= 2.0.
- classmethod to_dtype_dict()[source]¶
Return
{column_name: dtype}for this schema. Cached after first call.
- classmethod to_struct()¶
Return
{column_name: dtype}for this schema. Cached after first call.
- classmethod from_struct(dtypes, name='GeneratedSchema')¶
Create a
PandasSchemasubclass from a{col: dtype}dict.Column names that are not valid Python identifiers are automatically sanitized and wrapped with
alias().
- classmethod validate(df, subset=True)[source]¶
Compare a DataFrame against this schema.
Returns a list of errors; an empty list means the schema is valid. When
subset=True(default), extra columns are fine. Whensubset=False, extra columns are also reported as errors.
- classmethod assert_valid(df, subset=True, history=None)[source]¶
Like
validatebut raisesSchemaValidationErroron failure.
- classmethod to_code()[source]¶
Generate valid Python source code for this schema class.
- Return type:
- classmethod diff(other)[source]¶
Return a human-readable diff between two PandasSchema classes.
- Return type:
- Parameters:
other (type[PandasSchema])
import polars as pl
import dfguard.polars as dfg
from dfguard.polars import Optional
class OrderSchema(dfg.PolarsSchema):
order_id = pl.Int64
amount = pl.Float64
tags = pl.List(pl.String)
zip = Optional[pl.String]
class EnrichedSchema(OrderSchema):
revenue = pl.Float64
- class dfguard.polars.schema.PolarsSchema[source]¶
Declare a Polars DataFrame’s expected shape as a Python class.
Uses subset matching: the DataFrame must have every declared column with matching dtype, but extra columns are fine. Use
schema_of(df)for exact matching (no extra columns allowed).Annotation form (standard):
import polars as pl from dfguard.polars import PolarsSchema, Optional, enforce class OrderSchema(PolarsSchema): order_id: pl.Int64 amount: pl.Float64 tags: pl.List(pl.String) # first-class nested type name: Optional[pl.String] # explicitly nullable @enforce def process(df: OrderSchema): ...
Assignment form (avoids Pylance
reportInvalidTypeFormon instance-form dtypes):class OrderSchema(PolarsSchema): order_id = pl.Int64 tags = pl.List(pl.String)
Column-name aliasing:
from dfguard.polars import PolarsSchema, alias class OrderSchema(PolarsSchema): first_name = alias("First Name", pl.String) order_id = alias("order-id", pl.Int64)
Child classes inherit all parent fields:
class EnrichedSchema(OrderSchema): revenue: pl.Float64
Python builtins and generics¶
int->pl.Int64,float->pl.Float64,str->pl.String,bool->pl.Boolean,bytes->pl.Binary.list[T]->pl.List(T).dict->pl.Object.Nullability¶
All Polars columns are physically nullable. Use
Optional[T]to declare that nulls are expected.X | None(Python 3.11 native union) also works.Note
Requires Polars >= 1.0.
- classmethod to_struct()[source]¶
Return
{col: polars_dtype}for this schema. Cached after first call.
- classmethod from_struct(struct, name='GeneratedSchema')[source]¶
Create a
PolarsSchemasubclass from a{col: polars_dtype}dict.Column names that are not valid Python identifiers are automatically sanitized and wrapped with
alias().
- classmethod validate(df, subset=True)[source]¶
Compare a DataFrame or LazyFrame against this schema.
Returns a list of errors; empty means valid.
subset=True(default): extra columns are fine.subset=False: extra columns are also errors.
- classmethod assert_valid(df, subset=True, history=None)[source]¶
Like
validatebut raisesSchemaValidationErroron failure.
- classmethod empty()[source]¶
Return an empty Polars DataFrame with this schema and zero rows.
- Return type:
- classmethod to_code()[source]¶
Generate valid Python source code for this schema class.
- Return type:
- classmethod diff(other)[source]¶
Return a human-readable diff between two PolarsSchema classes.
- Return type:
- Parameters:
other (type[PolarsSchema])