Dataset ======= .. warning:: ``dataset()`` is an internal utility and is not part of the stable public API. It may change or be removed in future releases. ``dfg.dataset(df)`` wraps a DataFrame and records schema-changing operations in ``schema_history``. Schema tracking is limited to a fixed set of explicitly wrapped methods listed below. Calling any other method on the wrapper passes through to the underlying DataFrame but breaks the tracking chain: the returned object is a plain DataFrame, not a tracked dataset. .. tab-set:: .. tab-item:: PySpark :sync: pyspark .. autofunction:: dfguard.pyspark.dataset._make_dataset **Tracked methods:** ``withColumn``, ``withColumns``, ``withColumnRenamed``, ``withColumnsRenamed``, ``withMetadata``, ``drop``, ``select``, ``selectExpr``, ``toDF``, ``filter``, ``where``, ``limit``, ``sample``, ``distinct``, ``dropDuplicates``, ``orderBy``, ``repartition``, ``repartitionByRange``, ``coalesce``, ``union``, ``unionByName``, ``intersect``, ``intersectAll``, ``subtract``, ``join``, ``crossJoin``, ``groupBy``, ``rollup``, ``cube``, ``na``, ``stat``, ``transform``, ``unpivot``, ``agg``, ``count``, ``mean``, ``avg``, ``sum``, ``min``, ``max``, ``pivot``, ``apply``, ``applyInPandas`` .. tab-item:: pandas :sync: pandas .. autofunction:: dfguard.pandas.dataset._make_dataset **Tracked methods:** ``assign``, ``rename``, ``drop``, ``select``, ``astype``, ``filter``, ``query``, ``head``, ``tail``, ``sample``, ``drop_duplicates``, ``sort_values``, ``reset_index``, ``merge``, ``join``, ``groupby``, ``melt``, ``pivot_table``, ``explode``, ``agg``, ``sum``, ``mean``, ``count``, ``min``, ``max``, ``first`` .. tab-item:: Polars :sync: polars Works with both ``pl.DataFrame`` and ``pl.LazyFrame``. .. autofunction:: dfguard.polars.dataset.dataset **Tracked methods:** ``with_columns``, ``rename``, ``drop``, ``select``, ``filter``, ``sort``, ``unique``, ``join``, ``group_by``, ``agg`` schema_history -------------- Every dataset wrapper exposes ``schema_history``, an immutable record of all schema-changing operations since the DataFrame was wrapped. .. code-block:: python ds = dfg.dataset(raw_df) ds = ds.withColumn("revenue", F.col("amount") * F.col("quantity")) ds = ds.drop("tags") ds.schema_history.print() # Schema Evolution # [ 0] input # [ 1] withColumn('revenue') -- added: revenue:double # [ 2] drop(['tags']) -- dropped: tags .. tab-set:: .. tab-item:: PySpark :sync: pyspark .. autoclass:: dfguard.pyspark.history.SchemaHistory :members: .. autoclass:: dfguard.pyspark.history.SchemaChange :members: .. tab-item:: pandas :sync: pandas .. autoclass:: dfguard.pandas.history.PandasSchemaHistory :members: .. autoclass:: dfguard.pandas.history.PandasSchemaChange :members: .. tab-item:: Polars :sync: polars .. autoclass:: dfguard.polars.history.PolarsSchemaHistory :members: .. autoclass:: dfguard.polars.history.PolarsSchemaChange :members: