Supported Types¶
dfguard accepts whatever types your library accepts as annotations. There is no closed list compiled into the library; type dispatch is structural. This means any new type added in a future library release works automatically without a dfguard update. The scalar type tests in the test suite are generated at runtime from the installed library version to verify this claim.
All annotations are instances, not classes (T.LongType(), not
T.LongType). Complex types take arguments and enforce inner types
recursively at the schema level.
Numeric
Annotation |
Spark SQL type |
Notes |
|---|---|---|
|
|
8-bit signed integer |
|
|
16-bit signed integer |
|
|
|
|
|
|
|
|
32-bit |
|
|
64-bit |
|
|
arbitrary precision |
String and binary
Annotation |
Spark SQL type |
Notes |
|---|---|---|
|
|
|
|
|
fixed-length (>= 3.3) |
|
|
variable-length with max (>= 3.3) |
|
|
raw bytes |
Boolean and temporal
Annotation |
Spark SQL type |
Notes |
|---|---|---|
|
|
|
|
|
|
|
|
with timezone (legacy Spark behaviour) |
|
|
no timezone (>= 3.4) |
|
|
day-time interval (>= 3.3) |
|
|
year-month interval (>= 3.3) |
Complex and nested
Annotation |
Spark SQL type |
Notes |
|---|---|---|
|
|
inner type enforced |
|
|
key and value types enforced |
|
|
fully recursive |
|
|
converted to StructType automatically |
Nullability
Annotation |
Meaning |
|---|---|
|
declares |
|
declares |
Note
Use pd.ArrowDtype for nested types. pd.ArrowDtype(pa.list_(pa.struct([...])))
gives pandas full inner-type enforcement at every depth, the same as PySpark and
Polars. This is where PyArrow-backed pandas surpasses every other pandas dtype.
dfguard dispatches on the kind of annotation, not a hard-coded list. Any
dtype that is an instance of np.dtype or pd.api.extensions.ExtensionDtype
is accepted automatically, including third-party extension types.
NumPy dtypes
Annotation |
pandas dtype |
Notes |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Python objects (see note below) |
|
|
|
|
|
pandas nullable extension dtypes
Annotation |
pandas dtype |
Notes |
|---|---|---|
|
|
nullable |
|
|
nullable |
|
|
nullable |
|
|
nullable |
|
|
nullable |
|
|
nullable |
|
|
nullable |
|
|
nullable |
|
|
nullable |
|
|
nullable |
|
|
nullable |
|
|
nullable |
|
|
|
|
|
timezone-aware |
|
|
|
|
|
|
|
|
PyArrow-backed dtypes (pandas >= 1.5)
pd.ArrowDtype wraps any pyarrow.DataType and is a subclass of
pd.api.extensions.ExtensionDtype. dfguard accepts it through the same
structural path as every other extension dtype: no special handling, no
hard-coded list. Any pa.* type, including ones not listed here, works
automatically.
Integer and unsigned
Annotation |
pandas dtype |
Notes |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Float and decimal
Annotation |
pandas dtype |
Notes |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
arbitrary precision |
Boolean, string, and binary
Annotation |
pandas dtype |
Notes |
|---|---|---|
|
|
|
|
|
UTF-8 variable-length |
|
|
64-bit offsets |
|
|
variable-length bytes |
|
|
64-bit offsets |
|
|
fixed-width bytes (e.g. UUIDs) |
Temporal
Annotation |
pandas dtype |
Notes |
|---|---|---|
|
|
days since epoch |
|
|
milliseconds since epoch |
|
|
|
|
|
|
|
|
optional tz: |
|
|
|
Complex and nested
This is where PyArrow-backed pandas surpasses every other pandas dtype. Inner types are enforced at full depth, exactly like PySpark and Polars.
Annotation |
pandas dtype |
Notes |
|---|---|---|
|
|
inner type enforced |
|
|
64-bit offsets |
|
|
nested list |
|
|
struct with named fields, fully recursive |
|
|
list of dicts |
|
|
key and value types enforced |
|
|
map of string to list |
|
|
dictionary-encoded (categorical) |
|
|
fixed-width list (e.g. embeddings) |
Deeply nested example
import pyarrow as pa, pandas as pd
# list of structs, where one field is itself a list of floats
embedding_type = pd.ArrowDtype(
pa.list_(
pa.struct([
pa.field("label", pa.string()),
pa.field("scores", pa.list_(pa.float32())),
])
)
)
class ModelOutput(dfg.PandasSchema):
doc_id = pd.ArrowDtype(pa.int64())
results = embedding_type
Any pa.* type, including ones not shown here, is accepted without a
dfguard update. The dispatch is structural, not a lookup table.
Note
pd.ArrowDtype gives pandas columns the same nested-type precision as
Polars and PySpark. Use pd.ArrowDtype(pa.list_(pa.int64())) instead of
list[int] when inner-type enforcement matters. The object dtype
limitation does not apply to PyArrow-backed columns.
Python builtins and generics
Annotation |
pandas dtype |
Notes |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
inner type not enforced (use ArrowDtype instead) |
Nullability
Annotation |
Meaning |
|---|---|
|
non-nullable (NaN collapses to float) |
|
nullable integer (no NaN collapse) |
|
marks nullable intent; use |
|
native Python union syntax, also accepted |
pandas dtype reference | pd.ArrowDtype reference | pandas PyArrow integration guide
Polars dtypes work as both classes (pl.Int64) and instances
(pl.Datetime("ms", "UTC")). Both are accepted. Complex types enforce
inner types at the schema level.
Integer
Annotation |
Polars dtype |
Notes |
|---|---|---|
|
|
8-bit signed |
|
|
|
|
|
|
|
|
|
|
|
unsigned |
|
|
|
|
|
|
|
|
Float and numeric
Annotation |
Polars dtype |
Notes |
|---|---|---|
|
|
|
|
|
|
|
|
arbitrary precision |
String, binary, and boolean
Annotation |
Polars dtype |
Notes |
|---|---|---|
|
|
|
|
|
raw bytes |
|
|
|
|
|
|
|
|
fixed set of strings |
Temporal
Annotation |
Polars dtype |
Notes |
|---|---|---|
|
|
|
|
|
optional time unit + timezone |
|
|
|
|
|
Complex and nested
Annotation |
Polars dtype |
Notes |
|---|---|---|
|
|
inner type enforced |
|
|
fixed-width, inner type enforced |
|
|
recursive, all field types enforced |
|
|
arbitrary Python objects |
|
|
all-null column |
Python builtins and generics
Annotation |
Polars dtype |
Notes |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
inner type preserved |
|
|
|
|
|
|
|
|
Nullability
Annotation |
Meaning |
|---|---|
|
physically nullable (all Polars columns are) |
|
declares that nulls are intentional in this column |
|
native Python union syntax, also accepted |
Runtime type coverage¶
Scalar types for PySpark and Polars are tested via runtime discovery: the test
suite walks T.DataType.__subclasses__() and pl.DataType.__subclasses__()
recursively at test time and runs every concrete, no-argument-constructible type
through the conversion pipeline. New types added in future library releases are
covered automatically. Complex nested types are tested with multi-level
constructions (three-level nested struct, array of structs containing maps, etc.)
to verify that inner types are enforced at every depth.