dataframely.rule#

dataframely.rule( *, group_by: list[str] | None = None, ) → Callable[[Callable[[], Expr]], Rule][source]#

Mark a function as a rule to evaluate during validation.

The name of the function will be used as the name of the rule. The function should return an expression providing a boolean value whether a row is valid wrt. the rule. A value of true indicates validity.

Rules should be used only in the following two circumstances:

Validation requires accessing multiple columns (e.g. if valid values of column A depend on the value in column B).
Validation must be performed on groups of rows (e.g. if a column A must not contain any duplicate values among rows with the same value in column B).

In all other instances, column-level validation rules should be preferred as it aids readability and improves error messages.

Parameters:: group_by – An optional list of columns to group by for rules operating on groups of rows. If this list is provided, the returned expression must return a single boolean value, i.e. some kind of aggregation function must be used (e.g. sum, any, …).

Note

You’ll need to explicitly handle null values in your columns when defining rules. By default, any rule that evaluates to null because one of the columns used in the rule is null is interpreted as true, i.e. the row is assumed to be valid.

Attention

The rule logic should return a static result. Other implementations using arbitrary python logic works for filtering and validation, but may lead to wrong results in Schema comparisons and (de-)serialization.