Schema#

class dataframely.Schema[source]

Base class for all custom data frame schema definitions.

A custom schema should only define its columns via simple assignment:

class MySchema(Schema):
    a = dataframely.Int64()
    b = dataframely.String()

All definitions using non-datatype classes are ignored.

Schemas can also be nested (arbitrarily deeply): in this case, the columns defined in the subclass are simply appended to the columns in the superclass(es).

Methods:

cast

Cast a data frame to match the schema.

column_names

The column names of this schema.

columns

The column definitions of this schema.

create_empty

Create an empty data or lazy frame from this schema.

create_empty_if_none

Impute None input with an empty, schema-compliant lazy or eager data frame or return the input as lazy or eager frame.

filter

Filter the data frame by the rules of this schema.

is_valid

Check whether a data frame satisfies the schema.

matches

Check whether this schema semantically matches another schema.

primary_key

The primary key columns in this schema (possibly empty).

read_delta

Read a Delta Lake table into a typed data frame with this schema.

read_parquet

Read a parquet file into a typed data frame with this schema.

sample

Create a random data frame with a predefined number of rows.

scan_delta

Lazily read a Delta Lake table into a typed data frame with this schema.

scan_parquet

Lazily read a parquet file into a typed data frame with this schema.

serialize

Serialize this schema to a JSON string.

sink_parquet

Stream a typed lazy frame with this schema to a parquet file.

to_polars_schema

Obtain the polars schema for this schema.

to_pyarrow_schema

Obtain the pyarrow schema for this schema.

to_sqlalchemy_columns

Obtain the SQLAlchemy column definitions for a particular dialect for this schema.

validate

Validate that a data frame satisfies the schema.

write_delta

Write a typed data frame with this schema to a Delta Lake table.

write_parquet

Write a typed data frame with this schema to a parquet file.

classmethod cast(
df: DataFrame | LazyFrame,
/,
) DataFrame[Self] | LazyFrame[Self][source]

Cast a data frame to match the schema.

This method removes superfluous columns and casts all schema columns to the correct dtypes. However, it does not introspect the data frame contents.

Hence, this method should be used with care and validate() should generally be preferred. It is advised to only use this method if df is surely known to adhere to the schema.

Returns:

The input data frame, wrapped in a generic version of the input’s data frame type to reflect schema adherence.

Note

If you only require a generic data frame for the type checker, consider using typing.cast() instead of this method.

Attention

For lazy frames, casting is not performed eagerly. This prevents collecting the lazy frame’s schema but also means that a call to polars.LazyFrame.collect() further down the line might fail because of the cast and/or missing columns.

classmethod column_names() list[str][source]

The column names of this schema.

classmethod columns() dict[str, Column][source]

The column definitions of this schema.

classmethod create_empty(
*,
lazy: bool = False,
) DataFrame[Self] | LazyFrame[Self][source]

Create an empty data or lazy frame from this schema.

Parameters:

lazy – Whether to create a lazy data frame. If True, returns a lazy frame with this Schema. Otherwise, returns an eager frame.

Returns:

An instance of polars.DataFrame or polars.LazyFrame with this schema’s defined columns and their data types.

classmethod create_empty_if_none(
df: DataFrame[Self] | LazyFrame[Self] | None,
*,
lazy: bool = False,
) DataFrame[Self] | LazyFrame[Self][source]

Impute None input with an empty, schema-compliant lazy or eager data frame or return the input as lazy or eager frame.

Parameters:
  • df – The data frame to check for None. If it is not None, it is returned as lazy or eager frame. Otherwise, a schema-compliant data or lazy frame with no rows is returned.

  • lazy – Whether to return a lazy data frame. If True, returns a lazy frame with this Schema. Otherwise, returns an eager frame.

Returns:

The given data frame df as lazy or eager frame, if it is not None. An instance of polars.DataFrame or polars.LazyFrame with this schema’s defined columns and their data types, but no rows, otherwise.

classmethod filter(
df: DataFrame | LazyFrame,
/,
*,
cast: bool = False,
eager: bool = True,
) FilterResult[Self] | LazyFilterResult[Self][source]

Filter the data frame by the rules of this schema.

This method can be thought of as a “soft alternative” to validate(). While validate() raises an exception when a row does not adhere to the rules defined in the schema, this method simply filters out these rows and succeeds.

Parameters:
  • df – The data frame to filter for valid rows. The data frame is collected within this method, regardless of whether a DataFrame or LazyFrame is passed.

  • cast – Whether columns with a wrong data type in the input data frame are cast to the schema’s defined data type if possible. Rows for which the cast fails for any column are filtered out.

  • eager – Whether the filter operation should be performed eagerly. If False, the returned lazy frame will fail to collect if the validation does not pass.

Returns:

A tuple of the validated rows in the input data frame (potentially empty) and a simple dataclass carrying information about the rows of the data frame which could not be validated successfully. Just like in polars’ native filter(), the order of rows in the returned data frame is maintained.

Raises:

ValidationError – If the columns of the input data frame are invalid. This happens only if the data frame misses a column defined in the schema or a column has an invalid dtype while cast is set to False.

Note

This method preserves the ordering of the input data frame.

classmethod is_valid(
df: DataFrame | LazyFrame,
/,
*,
cast: bool = False,
) bool[source]

Check whether a data frame satisfies the schema.

This method has two major differences to validate():

  • It always collects the input to eagerly evaluate validity and return a boolean value.

  • It does not raise any of the documented exceptions for validate() and instead returns a value of False. Note that it still raises an exception if a lazy frame is provided as input and any logic prior to the validation causes an exception.

Parameters:
  • df – The data frame to check for validity.

  • cast – Whether columns with a wrong data type in the input data frame are cast to the schema’s defined data type before running validation. If set to False, a wrong data type will result in a return value of False.

Returns:

Whether the provided dataframe can be validated with this schema.

Notes

If you want to customize the engine being used for collecting the result within this method, consider wrapping the call in a context manager that sets the engine_affinity in the polars.Config.

classmethod matches(other: type[Schema]) bool[source]

Check whether this schema semantically matches another schema.

This method checks whether the schemas have the same columns (with the same data types and constraints) as well as the same rules.

Parameters:

other – The schema to compare with.

Returns:

Whether the schemas are semantically equal.

classmethod primary_key() list[str][source]

The primary key columns in this schema (possibly empty).

classmethod read_delta(
source: str | Path | deltalake.DeltaTable,
*,
validation: Validation = 'warn',
**kwargs: Any,
) DataFrame[Self][source]

Read a Delta Lake table into a typed data frame with this schema.

Compared to polars.read_delta(), this method checks the table’s metadata and runs validation if necessary to ensure that the data matches this schema.

Parameters:
  • source – Path or DeltaTable object from which to read the data.

  • validation

    The strategy for running validation when reading the data:

    • "allow": The method tries to read the parquet file’s metadata. If the stored schema matches this schema, the data frame is read without validation. If the stored schema mismatches this schema or no schema information can be found in the metadata, this method automatically runs validate() with cast=True.

    • "warn": The method behaves similarly to "allow". However, it prints a warning if validation is necessary.

    • "forbid": The method never runs validation automatically and only returns if the schema stored in the parquet file’s metadata matches this schema.

    • "skip": The method never runs validation and simply reads the parquet file, entrusting the user that the schema is valid. Use this option carefully and consider replacing it with :meth:`polars.read_delta` to convey the purpose better.

  • kwargs – Additional keyword arguments passed directly to polars.read_delta().

Returns:

The data frame with this schema.

Raises:

ValidationRequiredError – If no schema information can be read from the source and validation is set to "forbid".

Attention

Schema metadata is stored as custom commit metadata. Only the schema information from the last commit is used, so any table modifications that are not through dataframely will result in losing the metadata.

Be aware that appending to an existing table via mode=”append” may result in violation of group constraints that dataframely cannot catch without re-validating. Only use appends if you are certain that they do not break your schema.

This method suffers from the same limitations as serialize().

classmethod read_parquet(
source: str | Path | IO[bytes] | bytes | list[str] | list[Path] | list[IO[bytes]] | list[bytes],
*,
validation: Literal['allow', 'forbid', 'warn', 'skip'] = 'warn',
**kwargs: Any,
) DataFrame[Self][source]

Read a parquet file into a typed data frame with this schema.

Compared to polars.read_parquet(), this method checks the parquet file’s metadata and runs validation if necessary to ensure that the data matches this schema.

Parameters:
  • source – Path, directory, or file-like object from which to read the data.

  • validation

    The strategy for running validation when reading the data:

    • "allow": The method tries to read the parquet file’s metadata. If the stored schema matches this schema, the data frame is read without validation. If the stored schema mismatches this schema or no schema information can be found in the metadata, this method automatically runs validate() with cast=True.

    • "warn": The method behaves similarly to "allow". However, it prints a warning if validation is necessary.

    • "forbid": The method never runs validation automatically and only returns if the schema stored in the parquet file’s metadata matches this schema.

    • "skip": The method never runs validation and simply reads the parquet file, entrusting the user that the schema is valid. Use this option carefully and consider replacing it with :meth:`polars.read_parquet` to convey the purpose better.

  • kwargs – Additional keyword arguments passed directly to polars.read_parquet().

Returns:

The data frame with this schema.

Raises:

ValidationRequiredError – If no schema information can be read from the source and validation is set to "forbid".

Attention

Be aware that this method suffers from the same limitations as serialize().

classmethod sample(
num_rows: int | None = None,
*,
overrides: Mapping[str, Iterable[Any]] | Sequence[Mapping[str, Any]] | None = None,
generator: Generator | None = None,
) DataFrame[Self][source]

Create a random data frame with a predefined number of rows.

Generally, this method should only be used for testing. Also, if you want to generate _realistic_ test data, it is inevitable to implement your custom sampling logic (by making use of the Generator class).

In order to allow for sampling random data frames in the presence of custom rules and primary key constraints, this method performs fuzzy sampling: it samples in a loop until it finds a data frame of length num_rows which adhere to the schema. The maximum number of sampling rounds is configured via max_sampling_iterations in the Config class. By fixing this setting to 1, it is only possible to reliably sample from schemas without custom rules and without primary key constraints.

Parameters:
  • num_rows – The (optional) number of rows to sample for creating the random data frame. Must be provided (only) if no overrides are provided. If this is None, the number of rows in the data frame is determined by the length of the values in overrides.

  • overrides – Fixed values for a subset of the columns of the sampled data frame. Just like when initializing a polars.DataFrame, overrides may either be provided as “column-” or “row-layout”, i.e. via a mapping or a list of mappings, respectively. The number of rows in the result data frame is equal to the length of the values in overrides. If both overrides and num_rows are provided, the length of the values in overrides must be equal to num_rows. The order of the items is guaranteed to match the ordering in the returned data frame. When providing values for a column, no sampling is performed for that column.

  • generator – The (seeded) generator to use for sampling data. If None, a generator with random seed is automatically created.

Returns:

A data frame valid under the current schema with a number of rows that matches the length of the values in overrides or num_rows.

Raises:
  • ValueError – If num_rows is not equal to the length of the values in overrides.

  • ValueError – If overrides are specified as a sequence of mappings and the mappings do not provide the same keys.

  • ValueError – If no valid data frame can be found in the configured maximum number of iterations.

Attention

Be aware that, due to sampling in a loop, the runtime of this method can be significant for complex schemas. Consider passing a seeded generator and evaluate whether the runtime impact in the tests is bearable. Alternatively, it can be beneficial to provide custom column overrides for columns associated with complex validation rules.

classmethod scan_delta(
source: str | Path | deltalake.DeltaTable,
*,
validation: Validation = 'warn',
**kwargs: Any,
) LazyFrame[Self][source]

Lazily read a Delta Lake table into a typed data frame with this schema.

Compared to polars.scan_delta(), this method checks the table’s metadata and runs validation if necessary to ensure that the data matches this schema.

Parameters:
  • source – Path or DeltaTable object from which to read the data.

  • validation

    The strategy for running validation when reading the data:

    • "allow": The method tries to read the parquet file’s metadata. If the stored schema matches this schema, the data frame is read without validation. If the stored schema mismatches this schema or no schema information can be found in the metadata, this method automatically runs validate() with cast=True.

    • "warn": The method behaves similarly to "allow". However, it prints a warning if validation is necessary.

    • "forbid": The method never runs validation automatically and only returns if the schema stored in the parquet file’s metadata matches this schema.

    • "skip": The method never runs validation and simply reads the parquet file, entrusting the user that the schema is valid. Use this option carefully and consider replacing it with :meth:`polars.scan_delta` to convey the purpose better.

  • kwargs – Additional keyword arguments passed directly to polars.scan_delta().

Returns:

The lazy data frame with this schema.

Raises:

ValidationRequiredError – If no schema information can be read from the source and validation is set to "forbid".

Attention

Schema metadata is stored as custom commit metadata. Only the schema information from the last commit is used, so any table modifications that are not through dataframely will result in losing the metadata.

This method suffers from the same limitations as serialize().

classmethod scan_parquet(
source: str | Path | IO[bytes] | bytes | list[str] | list[Path] | list[IO[bytes]] | list[bytes],
*,
validation: Literal['allow', 'forbid', 'warn', 'skip'] = 'warn',
**kwargs: Any,
) LazyFrame[Self][source]

Lazily read a parquet file into a typed data frame with this schema.

Compared to polars.scan_parquet(), this method checks the parquet file’s metadata and runs validation if necessary to ensure that the data matches this schema.

Parameters:
  • source – Path, directory, or file-like object from which to read the data.

  • validation

    The strategy for running validation when reading the data:

    • "allow": The method tries to read the parquet file’s metadata. If the stored schema matches this schema, the data frame is read without validation. If the stored schema mismatches this schema or no schema information can be found in the metadata, this method automatically runs validate() with cast=True.

    • "warn": The method behaves similarly to "allow". However, it prints a warning if validation is necessary.

    • "forbid": The method never runs validation automatically and only returns if the schema stored in the parquet file’s metadata matches this schema.

    • "skip": The method never runs validation and simply reads the parquet file, entrusting the user that the schema is valid. Use this option carefully and consider replacing it with :meth:`polars.scan_parquet` to convey the purpose better.

  • kwargs – Additional keyword arguments passed directly to polars.scan_parquet().

Returns:

The data frame with this schema.

Raises:

ValidationRequiredError – If no schema information can be read from the source and validation is set to "forbid".

Attention

Be aware that this method suffers from the same limitations as serialize().

classmethod serialize() str[source]

Serialize this schema to a JSON string.

Returns:

The serialized schema.

Note

Serialization within dataframely itself will remain backwards-compatible at least within a major version. Until further notice, it will also be backwards-compatible across major versions.

Attention

Serialization of polars expressions is not guaranteed to be stable across versions of polars. This affects schemas that define custom rules or columns with custom checks: a schema serialized with one version of polars may not be deserializable with another version of polars.

Attention

This functionality is considered unstable. It may be changed at any time without it being considered a breaking change.

Raises:
  • TypeError – If any column contains metadata that is not JSON-serializable.

  • ValueError – If any column is not a “native” dataframely column type but a custom subclass.

classmethod sink_parquet(
lf: LazyFrame[Self],
/,
file: str | Path | IO[bytes] | PartitioningScheme,
**kwargs: Any,
) None[source]

Stream a typed lazy frame with this schema to a parquet file.

This method automatically adds a serialization of this schema to the parquet file as metadata. This metadata can be leveraged by read_parquet() and scan_parquet() for more efficient reading, or by external tools.

Parameters:
  • lf – The lazy frame to write to the parquet file.

  • file – The file path, writable file-like object, or partitioning scheme to which to write the parquet file.

  • kwargs – Additional keyword arguments passed directly to polars.write_parquet(). metadata may only be provided if it is a dictionary.

Attention

Be aware that this method suffers from the same limitations as serialize().

classmethod to_polars_schema() Schema[source]

Obtain the polars schema for this schema.

Returns:

A polars schema that mirrors the schema defined by this class.

classmethod to_pyarrow_schema() pa.Schema[source]

Obtain the pyarrow schema for this schema.

Returns:

A pyarrow schema that mirrors the schema defined by this class.

classmethod to_sqlalchemy_columns(dialect: sa.Dialect) list[sa.Column][source]

Obtain the SQLAlchemy column definitions for a particular dialect for this schema.

Parameters:

dialect – The dialect for which to obtain the SQL schema. Note that column datatypes may differ across dialects.

Returns:

A list of sqlalchemy columns that can be used to create a table with the schema as defined by this class.

classmethod validate(
df: DataFrame | LazyFrame,
/,
*,
cast: bool = False,
eager: bool = True,
) DataFrame[Self] | LazyFrame[Self][source]

Validate that a data frame satisfies the schema.

If an eager data frame is passed as input, validation is performed within this function. If a lazy frame is passed, the lazy frame is simply extended with the validation logic. The logic will only be executed (and potentially raise an error) once collect() is called on it.

Parameters:
  • df – The data frame to validate.

  • cast – Whether columns with a wrong data type in the input data frame are cast to the schema’s defined data type if possible.

  • eager – Whether the validation should be performed eagerly and this method should raise upon failure. If False, the returned lazy frame will fail to collect if the validation does not pass.

Returns:

The input eager or lazy frame, wrapped in a generic version of the input’s data frame type to reflect schema adherence. This operation is guaranteed to maintain input ordering of rows.

Raises:
  • SchemaError – If eager=True and the input data frame misses columns or cast=False and any data type mismatches the definition in this schema. Only raised upon collection if eager=False.

  • ValidationError – If eager=True and in any rule in the schema is violated, i.e. the data does not pass the validation. When eager=False, a ComputeError is raised upon collecting.

  • InvalidOperationError – If eager=True, cast=True, and the cast fails for any value in the data. Only raised upon collection if eager=False.

classmethod write_delta(
df: DataFrame[Self],
/,
target: str | Path | deltalake.DeltaTable,
**kwargs: Any,
) None[source]

Write a typed data frame with this schema to a Delta Lake table.

This method automatically adds a serialization of this schema to the Delta Lake table as metadata. The metadata can be leveraged by read_delta() and scan_delta() for efficient reading or by external tools.

Parameters:
  • df – The data frame to write to the Delta Lake table.

  • target – The path or DeltaTable object to which to write the data.

  • kwargs – Additional keyword arguments passed directly to polars.write_delta().

Attention

This method suffers from the same limitations as serialize().

Schema metadata is stored as custom commit metadata. Only the schema information from the last commit is used, so any table modifications that are not through dataframely will result in losing the metadata.

Be aware that appending to an existing table via mode=”append” may result in violation of group constraints that dataframely cannot catch without re-validating. Only use appends if you are certain that they do not break your schema.

classmethod write_parquet(
df: DataFrame[Self],
/,
file: str | Path | IO[bytes],
**kwargs: Any,
) None[source]

Write a typed data frame with this schema to a parquet file.

This method automatically adds a serialization of this schema to the parquet file as metadata. This metadata can be leveraged by read_parquet() and scan_parquet() for more efficient reading, or by external tools.

Parameters:
  • df – The data frame to write to the parquet file.

  • file – The file path or writable file-like object to which to write the parquet file. This should be a path to a directory if writing a partitioned dataset.

  • kwargs – Additional keyword arguments passed directly to polars.write_parquet(). metadata may only be provided if it is a dictionary.

Attention

Be aware that this method suffers from the same limitations as serialize().