Schema#
- class dataframely.Schema[source]
Base class for all custom data frame schema definitions.
A custom schema should only define its columns via simple assignment:
class MySchema(Schema): a = dataframely.Int64() b = dataframely.String()
All definitions using non-datatype classes are ignored.
Schemas can also be nested (arbitrarily deeply): in this case, the columns defined in the subclass are simply appended to the columns in the superclass(es).
Methods:
Cast a data frame to match the schema.
The column names of this schema.
The column definitions of this schema.
Create an empty data or lazy frame from this schema.
Impute
Noneinput with an empty, schema-compliant lazy or eager data frame or return the input as lazy or eager frame.Filter the data frame by the rules of this schema.
Check whether a data frame satisfies the schema.
Check whether this schema semantically matches another schema.
The primary key columns in this schema (possibly empty).
Read a Delta Lake table into a typed data frame with this schema.
Read a parquet file into a typed data frame with this schema.
Create a random data frame with a predefined number of rows.
Lazily read a Delta Lake table into a typed data frame with this schema.
Lazily read a parquet file into a typed data frame with this schema.
Serialize this schema to a JSON string.
Stream a typed lazy frame with this schema to a parquet file.
Obtain the polars schema for this schema.
Obtain the pyarrow schema for this schema.
Obtain the SQLAlchemy column definitions for a particular dialect for this schema.
Validate that a data frame satisfies the schema.
Write a typed data frame with this schema to a Delta Lake table.
Write a typed data frame with this schema to a parquet file.
- classmethod cast(
- df: DataFrame | LazyFrame,
- /,
Cast a data frame to match the schema.
This method removes superfluous columns and casts all schema columns to the correct dtypes. However, it does not introspect the data frame contents.
Hence, this method should be used with care and
validate()should generally be preferred. It is advised to only use this method ifdfis surely known to adhere to the schema.- Returns:
The input data frame, wrapped in a generic version of the input’s data frame type to reflect schema adherence.
Note
If you only require a generic data frame for the type checker, consider using
typing.cast()instead of this method.Attention
For lazy frames, casting is not performed eagerly. This prevents collecting the lazy frame’s schema but also means that a call to
polars.LazyFrame.collect()further down the line might fail because of the cast and/or missing columns.
- classmethod create_empty(
- *,
- lazy: bool = False,
Create an empty data or lazy frame from this schema.
- Parameters:
lazy – Whether to create a lazy data frame. If
True, returns a lazy frame with thisSchema. Otherwise, returns an eager frame.- Returns:
An instance of
polars.DataFrameorpolars.LazyFramewith this schema’s defined columns and their data types.
- classmethod create_empty_if_none( ) DataFrame[Self] | LazyFrame[Self][source]
Impute
Noneinput with an empty, schema-compliant lazy or eager data frame or return the input as lazy or eager frame.- Parameters:
df – The data frame to check for
None. If it is notNone, it is returned as lazy or eager frame. Otherwise, a schema-compliant data or lazy frame with no rows is returned.lazy – Whether to return a lazy data frame. If
True, returns a lazy frame with thisSchema. Otherwise, returns an eager frame.
- Returns:
The given data frame
dfas lazy or eager frame, if it is notNone. An instance ofpolars.DataFrameorpolars.LazyFramewith this schema’s defined columns and their data types, but no rows, otherwise.
- classmethod filter( ) FilterResult[Self] | LazyFilterResult[Self][source]
Filter the data frame by the rules of this schema.
This method can be thought of as a “soft alternative” to
validate(). Whilevalidate()raises an exception when a row does not adhere to the rules defined in the schema, this method simply filters out these rows and succeeds.- Parameters:
df – The data frame to filter for valid rows. The data frame is collected within this method, regardless of whether a
DataFrameorLazyFrameis passed.cast – Whether columns with a wrong data type in the input data frame are cast to the schema’s defined data type if possible. Rows for which the cast fails for any column are filtered out.
eager – Whether the filter operation should be performed eagerly. If
False, the returned lazy frame will fail to collect if the validation does not pass.
- Returns:
A tuple of the validated rows in the input data frame (potentially empty) and a simple dataclass carrying information about the rows of the data frame which could not be validated successfully. Just like in polars’ native
filter(), the order of rows in the returned data frame is maintained.- Raises:
ValidationError – If the columns of the input data frame are invalid. This happens only if the data frame misses a column defined in the schema or a column has an invalid dtype while
castis set toFalse.
Note
This method preserves the ordering of the input data frame.
- classmethod is_valid(
- df: DataFrame | LazyFrame,
- /,
- *,
- cast: bool = False,
Check whether a data frame satisfies the schema.
This method has two major differences to
validate():It always collects the input to eagerly evaluate validity and return a boolean value.
It does not raise any of the documented exceptions for
validate()and instead returns a value ofFalse. Note that it still raises an exception if a lazy frame is provided as input and any logic prior to the validation causes an exception.
- Parameters:
df – The data frame to check for validity.
cast – Whether columns with a wrong data type in the input data frame are cast to the schema’s defined data type before running validation. If set to
False, a wrong data type will result in a return value ofFalse.
- Returns:
Whether the provided dataframe can be validated with this schema.
Notes
If you want to customize the engine being used for collecting the result within this method, consider wrapping the call in a context manager that sets the
engine_affinityin thepolars.Config.
- classmethod matches(other: type[Schema]) bool[source]
Check whether this schema semantically matches another schema.
This method checks whether the schemas have the same columns (with the same data types and constraints) as well as the same rules.
- Parameters:
other – The schema to compare with.
- Returns:
Whether the schemas are semantically equal.
- classmethod primary_key() list[str][source]
The primary key columns in this schema (possibly empty).
- classmethod read_delta( ) DataFrame[Self][source]
Read a Delta Lake table into a typed data frame with this schema.
Compared to
polars.read_delta(), this method checks the table’s metadata and runs validation if necessary to ensure that the data matches this schema.- Parameters:
source – Path or DeltaTable object from which to read the data.
validation –
The strategy for running validation when reading the data:
"allow": The method tries to read the parquet file’s metadata. If the stored schema matches this schema, the data frame is read without validation. If the stored schema mismatches this schema or no schema information can be found in the metadata, this method automatically runsvalidate()withcast=True."warn": The method behaves similarly to"allow". However, it prints a warning if validation is necessary."forbid": The method never runs validation automatically and only returns if the schema stored in the parquet file’s metadata matches this schema."skip": The method never runs validation and simply reads the parquet file, entrusting the user that the schema is valid. Use this option carefully and consider replacing it with :meth:`polars.read_delta` to convey the purpose better.
kwargs – Additional keyword arguments passed directly to
polars.read_delta().
- Returns:
The data frame with this schema.
- Raises:
ValidationRequiredError – If no schema information can be read from the source and
validationis set to"forbid".
Attention
Schema metadata is stored as custom commit metadata. Only the schema information from the last commit is used, so any table modifications that are not through dataframely will result in losing the metadata.
Be aware that appending to an existing table via mode=”append” may result in violation of group constraints that dataframely cannot catch without re-validating. Only use appends if you are certain that they do not break your schema.
This method suffers from the same limitations as
serialize().
- classmethod read_parquet(
- source: str | Path | IO[bytes] | bytes | list[str] | list[Path] | list[IO[bytes]] | list[bytes],
- *,
- validation: Literal['allow', 'forbid', 'warn', 'skip'] = 'warn',
- **kwargs: Any,
Read a parquet file into a typed data frame with this schema.
Compared to
polars.read_parquet(), this method checks the parquet file’s metadata and runs validation if necessary to ensure that the data matches this schema.- Parameters:
source – Path, directory, or file-like object from which to read the data.
validation –
The strategy for running validation when reading the data:
"allow": The method tries to read the parquet file’s metadata. If the stored schema matches this schema, the data frame is read without validation. If the stored schema mismatches this schema or no schema information can be found in the metadata, this method automatically runsvalidate()withcast=True."warn": The method behaves similarly to"allow". However, it prints a warning if validation is necessary."forbid": The method never runs validation automatically and only returns if the schema stored in the parquet file’s metadata matches this schema."skip": The method never runs validation and simply reads the parquet file, entrusting the user that the schema is valid. Use this option carefully and consider replacing it with :meth:`polars.read_parquet` to convey the purpose better.
kwargs – Additional keyword arguments passed directly to
polars.read_parquet().
- Returns:
The data frame with this schema.
- Raises:
ValidationRequiredError – If no schema information can be read from the source and
validationis set to"forbid".
Attention
Be aware that this method suffers from the same limitations as
serialize().
- classmethod sample(
- num_rows: int | None = None,
- *,
- overrides: Mapping[str, Iterable[Any]] | Sequence[Mapping[str, Any]] | None = None,
- generator: Generator | None = None,
Create a random data frame with a predefined number of rows.
Generally, this method should only be used for testing. Also, if you want to generate _realistic_ test data, it is inevitable to implement your custom sampling logic (by making use of the
Generatorclass).In order to allow for sampling random data frames in the presence of custom rules and primary key constraints, this method performs fuzzy sampling: it samples in a loop until it finds a data frame of length
num_rowswhich adhere to the schema. The maximum number of sampling rounds is configured viamax_sampling_iterationsin theConfigclass. By fixing this setting to 1, it is only possible to reliably sample from schemas without custom rules and without primary key constraints.- Parameters:
num_rows – The (optional) number of rows to sample for creating the random data frame. Must be provided (only) if no
overridesare provided. If this isNone, the number of rows in the data frame is determined by the length of the values inoverrides.overrides – Fixed values for a subset of the columns of the sampled data frame. Just like when initializing a
polars.DataFrame, overrides may either be provided as “column-” or “row-layout”, i.e. via a mapping or a list of mappings, respectively. The number of rows in the result data frame is equal to the length of the values inoverrides. If bothoverridesandnum_rowsare provided, the length of the values inoverridesmust be equal tonum_rows. The order of the items is guaranteed to match the ordering in the returned data frame. When providing values for a column, no sampling is performed for that column.generator – The (seeded) generator to use for sampling data. If
None, a generator with random seed is automatically created.
- Returns:
A data frame valid under the current schema with a number of rows that matches the length of the values in
overridesornum_rows.- Raises:
ValueError – If
num_rowsis not equal to the length of the values inoverrides.ValueError – If
overridesare specified as a sequence of mappings and the mappings do not provide the same keys.ValueError – If no valid data frame can be found in the configured maximum number of iterations.
Attention
Be aware that, due to sampling in a loop, the runtime of this method can be significant for complex schemas. Consider passing a seeded generator and evaluate whether the runtime impact in the tests is bearable. Alternatively, it can be beneficial to provide custom column overrides for columns associated with complex validation rules.
- classmethod scan_delta( ) LazyFrame[Self][source]
Lazily read a Delta Lake table into a typed data frame with this schema.
Compared to
polars.scan_delta(), this method checks the table’s metadata and runs validation if necessary to ensure that the data matches this schema.- Parameters:
source – Path or DeltaTable object from which to read the data.
validation –
The strategy for running validation when reading the data:
"allow": The method tries to read the parquet file’s metadata. If the stored schema matches this schema, the data frame is read without validation. If the stored schema mismatches this schema or no schema information can be found in the metadata, this method automatically runsvalidate()withcast=True."warn": The method behaves similarly to"allow". However, it prints a warning if validation is necessary."forbid": The method never runs validation automatically and only returns if the schema stored in the parquet file’s metadata matches this schema."skip": The method never runs validation and simply reads the parquet file, entrusting the user that the schema is valid. Use this option carefully and consider replacing it with :meth:`polars.scan_delta` to convey the purpose better.
kwargs – Additional keyword arguments passed directly to
polars.scan_delta().
- Returns:
The lazy data frame with this schema.
- Raises:
ValidationRequiredError – If no schema information can be read from the source and
validationis set to"forbid".
Attention
Schema metadata is stored as custom commit metadata. Only the schema information from the last commit is used, so any table modifications that are not through dataframely will result in losing the metadata.
This method suffers from the same limitations as
serialize().
- classmethod scan_parquet(
- source: str | Path | IO[bytes] | bytes | list[str] | list[Path] | list[IO[bytes]] | list[bytes],
- *,
- validation: Literal['allow', 'forbid', 'warn', 'skip'] = 'warn',
- **kwargs: Any,
Lazily read a parquet file into a typed data frame with this schema.
Compared to
polars.scan_parquet(), this method checks the parquet file’s metadata and runs validation if necessary to ensure that the data matches this schema.- Parameters:
source – Path, directory, or file-like object from which to read the data.
validation –
The strategy for running validation when reading the data:
"allow": The method tries to read the parquet file’s metadata. If the stored schema matches this schema, the data frame is read without validation. If the stored schema mismatches this schema or no schema information can be found in the metadata, this method automatically runsvalidate()withcast=True."warn": The method behaves similarly to"allow". However, it prints a warning if validation is necessary."forbid": The method never runs validation automatically and only returns if the schema stored in the parquet file’s metadata matches this schema."skip": The method never runs validation and simply reads the parquet file, entrusting the user that the schema is valid. Use this option carefully and consider replacing it with :meth:`polars.scan_parquet` to convey the purpose better.
kwargs – Additional keyword arguments passed directly to
polars.scan_parquet().
- Returns:
The data frame with this schema.
- Raises:
ValidationRequiredError – If no schema information can be read from the source and
validationis set to"forbid".
Attention
Be aware that this method suffers from the same limitations as
serialize().
- classmethod serialize() str[source]
Serialize this schema to a JSON string.
- Returns:
The serialized schema.
Note
Serialization within dataframely itself will remain backwards-compatible at least within a major version. Until further notice, it will also be backwards-compatible across major versions.
Attention
Serialization of
polarsexpressions is not guaranteed to be stable across versions of polars. This affects schemas that define custom rules or columns with custom checks: a schema serialized with one version of polars may not be deserializable with another version of polars.Attention
This functionality is considered unstable. It may be changed at any time without it being considered a breaking change.
- Raises:
TypeError – If any column contains metadata that is not JSON-serializable.
ValueError – If any column is not a “native” dataframely column type but a custom subclass.
- classmethod sink_parquet( ) None[source]
Stream a typed lazy frame with this schema to a parquet file.
This method automatically adds a serialization of this schema to the parquet file as metadata. This metadata can be leveraged by
read_parquet()andscan_parquet()for more efficient reading, or by external tools.- Parameters:
lf – The lazy frame to write to the parquet file.
file – The file path, writable file-like object, or partitioning scheme to which to write the parquet file.
kwargs – Additional keyword arguments passed directly to
polars.write_parquet().metadatamay only be provided if it is a dictionary.
Attention
Be aware that this method suffers from the same limitations as
serialize().
- classmethod to_polars_schema() Schema[source]
Obtain the polars schema for this schema.
- Returns:
A
polarsschema that mirrors the schema defined by this class.
- classmethod to_pyarrow_schema() pa.Schema[source]
Obtain the pyarrow schema for this schema.
- Returns:
A
pyarrowschema that mirrors the schema defined by this class.
- classmethod to_sqlalchemy_columns(dialect: sa.Dialect) list[sa.Column][source]
Obtain the SQLAlchemy column definitions for a particular dialect for this schema.
- Parameters:
dialect – The dialect for which to obtain the SQL schema. Note that column datatypes may differ across dialects.
- Returns:
A list of
sqlalchemycolumns that can be used to create a table with the schema as defined by this class.
- classmethod validate( ) DataFrame[Self] | LazyFrame[Self][source]
Validate that a data frame satisfies the schema.
If an eager data frame is passed as input, validation is performed within this function. If a lazy frame is passed, the lazy frame is simply extended with the validation logic. The logic will only be executed (and potentially raise an error) once
collect()is called on it.- Parameters:
df – The data frame to validate.
cast – Whether columns with a wrong data type in the input data frame are cast to the schema’s defined data type if possible.
eager – Whether the validation should be performed eagerly and this method should raise upon failure. If
False, the returned lazy frame will fail to collect if the validation does not pass.
- Returns:
The input eager or lazy frame, wrapped in a generic version of the input’s data frame type to reflect schema adherence. This operation is guaranteed to maintain input ordering of rows.
- Raises:
SchemaError – If
eager=Trueand the input data frame misses columns orcast=Falseand any data type mismatches the definition in this schema. Only raised upon collection ifeager=False.ValidationError – If
eager=Trueand in any rule in the schema is violated, i.e. the data does not pass the validation. Wheneager=False, aComputeErroris raised upon collecting.InvalidOperationError – If
eager=True,cast=True, and the cast fails for any value in the data. Only raised upon collection ifeager=False.
- classmethod write_delta( ) None[source]
Write a typed data frame with this schema to a Delta Lake table.
This method automatically adds a serialization of this schema to the Delta Lake table as metadata. The metadata can be leveraged by
read_delta()andscan_delta()for efficient reading or by external tools.- Parameters:
df – The data frame to write to the Delta Lake table.
target – The path or DeltaTable object to which to write the data.
kwargs – Additional keyword arguments passed directly to
polars.write_delta().
Attention
This method suffers from the same limitations as
serialize().Schema metadata is stored as custom commit metadata. Only the schema information from the last commit is used, so any table modifications that are not through dataframely will result in losing the metadata.
Be aware that appending to an existing table via mode=”append” may result in violation of group constraints that dataframely cannot catch without re-validating. Only use appends if you are certain that they do not break your schema.
- classmethod write_parquet( ) None[source]
Write a typed data frame with this schema to a parquet file.
This method automatically adds a serialization of this schema to the parquet file as metadata. This metadata can be leveraged by
read_parquet()andscan_parquet()for more efficient reading, or by external tools.- Parameters:
df – The data frame to write to the parquet file.
file – The file path or writable file-like object to which to write the parquet file. This should be a path to a directory if writing a partitioned dataset.
kwargs – Additional keyword arguments passed directly to
polars.write_parquet().metadatamay only be provided if it is a dictionary.
Attention
Be aware that this method suffers from the same limitations as
serialize().