Dataset

Warning

dataset() is an internal utility and is not part of the stable public API. It may change or be removed in future releases.

dfg.dataset(df) wraps a DataFrame and records schema-changing operations in schema_history. Schema tracking is limited to a fixed set of explicitly wrapped methods listed below. Calling any other method on the wrapper passes through to the underlying DataFrame but breaks the tracking chain: the returned object is a plain DataFrame, not a tracked dataset.

dfguard.pyspark.dataset._make_dataset(df, *, history=None, strict=False)[source]

Create a _TypedDatasetBase instance from a DataFrame.

This is the low-level factory used internally (by _wrap, decorators, and SparkSchema.empty). End-users should call dataset(df) to get a tracked instance, or schema_of(df) to get the schema type class.

Return type:

_TypedDatasetBase

Parameters:

Tracked methods: withColumn, withColumns, withColumnRenamed, withColumnsRenamed, withMetadata, drop, select, selectExpr, toDF, filter, where, limit, sample, distinct, dropDuplicates, orderBy, repartition, repartitionByRange, coalesce, union, unionByName, intersect, intersectAll, subtract, join, crossJoin, groupBy, rollup, cube, na, stat, transform, unpivot, agg, count, mean, avg, sum, min, max, pivot, apply, applyInPandas

dfguard.pandas.dataset._make_dataset(df, *, history=None)[source]

Low-level factory used internally and by the public dataset() alias.

Return type:

_PandasDatasetBase

Parameters:
  • df (Any)

  • history (SchemaHistory | None)

Tracked methods: assign, rename, drop, select, astype, filter, query, head, tail, sample, drop_duplicates, sort_values, reset_index, merge, join, groupby, melt, pivot_table, explode, agg, sum, mean, count, min, max, first

Works with both pl.DataFrame and pl.LazyFrame.

dfguard.polars.dataset.dataset(df)
Return type:

_PolarsDatasetBase

Parameters:

df (Any)

Tracked methods: with_columns, rename, drop, select, filter, sort, unique, join, group_by, agg

schema_history

Every dataset wrapper exposes schema_history, an immutable record of all schema-changing operations since the DataFrame was wrapped.

ds = dfg.dataset(raw_df)
ds = ds.withColumn("revenue", F.col("amount") * F.col("quantity"))
ds = ds.drop("tags")

ds.schema_history.print()
# Schema Evolution
#   [ 0] input
#   [ 1] withColumn('revenue')  -- added: revenue:double
#   [ 2] drop(['tags'])         -- dropped: tags
class dfguard.pyspark.history.SchemaHistory(changes)[source]

Immutable list of SchemaChanges. Each append returns a new instance.

Parameters:

changes (tuple[SchemaChange, ...])

print()[source]

Print the schema evolution to stdout.

Return type:

None

class dfguard.pyspark.history.SchemaChange(operation, schema_after, added=<factory>, dropped=<factory>, type_changed=<factory>, nullable_changed=<factory>)[source]

One step in the schema evolution chain.

Parameters:
classmethod compute(operation, before, after)[source]

Diff two StructTypes from Spark and record exactly what changed, including nested types.

Return type:

SchemaChange

Parameters:
dfguard.pandas.history.PandasSchemaHistory

alias of SchemaHistory

dfguard.pandas.history.PandasSchemaChange

alias of DictSchemaChange

dfguard.polars.history.PolarsSchemaHistory

alias of SchemaHistory

dfguard.polars.history.PolarsSchemaChange

alias of DictSchemaChange