Dataset¶
Warning
dataset() is an internal utility and is not part of the stable public API.
It may change or be removed in future releases.
dfg.dataset(df) wraps a DataFrame and records schema-changing operations
in schema_history. Schema tracking is limited to a fixed set of explicitly
wrapped methods listed below. Calling any other method on the wrapper passes
through to the underlying DataFrame but breaks the tracking chain: the returned
object is a plain DataFrame, not a tracked dataset.
- dfguard.pyspark.dataset._make_dataset(df, *, history=None, strict=False)[source]¶
Create a
_TypedDatasetBaseinstance from a DataFrame.This is the low-level factory used internally (by
_wrap, decorators, andSparkSchema.empty). End-users should calldataset(df)to get a tracked instance, orschema_of(df)to get the schema type class.
Tracked methods: withColumn, withColumns, withColumnRenamed,
withColumnsRenamed, withMetadata, drop, select, selectExpr,
toDF, filter, where, limit, sample, distinct,
dropDuplicates, orderBy, repartition, repartitionByRange,
coalesce, union, unionByName, intersect, intersectAll,
subtract, join, crossJoin, groupBy, rollup, cube,
na, stat, transform, unpivot, agg, count, mean,
avg, sum, min, max, pivot, apply, applyInPandas
- dfguard.pandas.dataset._make_dataset(df, *, history=None)[source]¶
Low-level factory used internally and by the public
dataset()alias.- Return type:
_PandasDatasetBase- Parameters:
df (Any)
history (SchemaHistory | None)
Tracked methods: assign, rename, drop, select, astype,
filter, query, head, tail, sample, drop_duplicates,
sort_values, reset_index, merge, join, groupby, melt,
pivot_table, explode, agg, sum, mean, count, min,
max, first
schema_history¶
Every dataset wrapper exposes schema_history, an immutable record of all
schema-changing operations since the DataFrame was wrapped.
ds = dfg.dataset(raw_df)
ds = ds.withColumn("revenue", F.col("amount") * F.col("quantity"))
ds = ds.drop("tags")
ds.schema_history.print()
# Schema Evolution
# [ 0] input
# [ 1] withColumn('revenue') -- added: revenue:double
# [ 2] drop(['tags']) -- dropped: tags
- class dfguard.pyspark.history.SchemaHistory(changes)[source]¶
Immutable list of SchemaChanges. Each append returns a new instance.
- Parameters:
changes (tuple[SchemaChange, ...])