Dataset¶

Warning

dataset() is an internal utility and is not part of the stable public API. It may change or be removed in future releases.

dfg.dataset(df) wraps a DataFrame and records schema-changing operations in schema_history. Schema tracking is limited to a fixed set of explicitly wrapped methods listed below. Calling any other method on the wrapper passes through to the underlying DataFrame but breaks the tracking chain: the returned object is a plain DataFrame, not a tracked dataset.

PySpark

dfguard.pyspark.dataset._make_dataset(df, *, history=None, strict=False)[source]¶

Create a _TypedDatasetBase instance from a DataFrame.

This is the low-level factory used internally (by _wrap, decorators, and SparkSchema.empty). End-users should call dataset(df) to get a tracked instance, or schema_of(df) to get the schema type class.

Return type:

_TypedDatasetBase

Parameters:

df (Any)
history (Any)
strict (bool)

Tracked methods: withColumn, withColumns, withColumnRenamed, withColumnsRenamed, withMetadata, drop, select, selectExpr, toDF, filter, where, limit, sample, distinct, dropDuplicates, orderBy, repartition, repartitionByRange, coalesce, union, unionByName, intersect, intersectAll, subtract, join, crossJoin, groupBy, rollup, cube, na, stat, transform, unpivot, agg, count, mean, avg, sum, min, max, pivot, apply, applyInPandas

pandas

dfguard.pandas.dataset._make_dataset(df, *, history=None)[source]¶

Low-level factory used internally and by the public dataset() alias.

Return type:

_PandasDatasetBase

Parameters:

df (Any)
history (SchemaHistory | None)

Tracked methods: assign, rename, drop, select, astype, filter, query, head, tail, sample, drop_duplicates, sort_values, reset_index, merge, join, groupby, melt, pivot_table, explode, agg, sum, mean, count, min, max, first

Polars

Works with both pl.DataFrame and pl.LazyFrame.

dfguard.polars.dataset.dataset(df)¶

Return type:: _PolarsDatasetBase
Parameters:: df (Any)

Tracked methods: with_columns, rename, drop, select, filter, sort, unique, join, group_by, agg

schema_history¶

Every dataset wrapper exposes schema_history, an immutable record of all schema-changing operations since the DataFrame was wrapped.

ds = dfg.dataset(raw_df)
ds = ds.withColumn("revenue", F.col("amount") * F.col("quantity"))
ds = ds.drop("tags")

ds.schema_history.print()
# Schema Evolution
#   [ 0] input
#   [ 1] withColumn('revenue')  -- added: revenue:double
#   [ 2] drop(['tags'])         -- dropped: tags

PySpark

class dfguard.pyspark.history.SchemaHistory(changes)[source]¶

Immutable list of SchemaChanges. Each append returns a new instance.

Parameters:: changes (tuple[SchemaChange, ...])

print()[source]¶

Print the schema evolution to stdout.

Return type:: None

class dfguard.pyspark.history.SchemaChange(operation, schema_after, added=<factory>, dropped=<factory>, type_changed=<factory>, nullable_changed=<factory>)[source]¶

One step in the schema evolution chain.

Parameters:

operation (str)
schema_after (Any)
added (tuple[Any, ...])
dropped (tuple[str, ...])
type_changed (tuple[tuple[str, Any, Any], ...])
nullable_changed (tuple[tuple[str, bool, bool], ...])

classmethod compute(operation, before, after)[source]¶

Diff two StructTypes from Spark and record exactly what changed, including nested types.

Return type:

SchemaChange

Parameters:

operation (str)
before (Any)
after (Any)

pandas

dfguard.pandas.history.PandasSchemaHistory¶: alias of SchemaHistory

dfguard.pandas.history.PandasSchemaChange¶: alias of DictSchemaChange

Polars

dfguard.polars.history.PolarsSchemaHistory¶: alias of SchemaHistory

dfguard.polars.history.PolarsSchemaChange¶: alias of DictSchemaChange